Introduction

The host immune system responds to diverse foreign antigens, partly through recombination of variable, diversity, and joining (V, D and J) chromosomal segments within T-cell receptor (TCR) α, β, \(\gamma\), and \(\delta\) loci in thymocytes. Random nucleotide insertion and deletion in CDR3 (complementarity-determining region 3) segment junctions provide theoretical yields of over 1015 different human TCR-αβ receptors [1] dispersed among approximately 1011 naïve T-cells circulating at any one time [2].

When T-cells are sampled from blood, most TCR sequences are observed only once, although some are found multiply in part due to identical recombination events occurring in the thymus [3] and in part due to clonal expansion of T-cells whose TCR binds to an antigen-bound major histocompatibility complex protein. Clonal expansion of T-cells occurs in disease states [4,5,6,7,8,9] and in response to infection, for example with Epstein-Barr virus [10]. A minority (~ 10%) of individuals with acute viral or bacterial infection experience a prolonged illness lasting 12 months [11, 12]. Many people with myalgic encephalomyelitis (ME) report an onset of symptoms after infection [13,14,15]. ME is a multi-systemic disabling disease that results in a health-related quality of life worse than 20 other conditions including chronic inflammatory disorders, cancer and multiple sclerosis [16, 17]. A quarter of people with ME report being house- or bed-bound [18]. ME is not rare: it affects, for example, between 836,000 and 2.5 million Americans [16] and an estimated 0.2–0.4% of the UK population [19, 20]. Despite its high prevalence and burden of illness, no reliable biomarker or diagnostic test exists and its aetiology is unknown.

Given that its onset is frequently ascribed to infection and hypotheses that it may be an autoimmune condition [16] we set out to determine whether blood samples from people with ME show an expansion of TCR clonotypes compared with healthy or disease controls.

Results

Samples and entropy metrics

Forty human peripheral blood mononuclear cell (PBMC) samples were received from each of four groups: (i) Severely affected people with ME (either house- or bed-bound; MEsa); (ii) Mildly or moderately affected people with ME (MEmm); (iii) people diagnosed with Multiple Sclerosis (MS; disease controls); and, (iv) Healthy controls (HC). Samples were sourced from the CureME Biobank [21] from female donors aged 40–60 years, chosen to reduce possible age- or sex-dependent confounding effects, although their limited availability among MEsa required us to source samples from younger donors (Table 1).

Table 1 PBMC samples

Six CD8+ samples were discarded: one had insufficient enriched cells, five libraries had been cross-contaminated. Among CD4+ samples, 12 were discarded: two had insufficient enriched cells, four had insufficient DNA, and six failed sequence library synthesis, leaving 154 CD8+ samples and 148 CD4+ samples. Analysis of variance (ANOVA) for the experimental parameters (Table 2) showed that the only association with sample group was CD8+ cell number\((p=0.03\)). This was not significant after accounting for multiple tests.

Table 2 Mean values of cell count and purity, and numbers of CD8+ or CD4+ samples

Human herpesvirus infection may be a trigger for ME [22], and Cytomegalovirus (CMV) also affects T-cell clonal diversity [23, 24]. CMV infection is very common in the UK [25] although CMV seropositivity in this study, available for 98 of the 160 samples, was not significantly associated with sample group status (\(p=0.58\), χ2-test; Table 3).

Table 3 Cytomegalovirus seropositivity/seronegativity results, shown by group. Equivocal samples were those with indeterminate IgG levels

We chose Rényi entropy as our T-cell receptor diversity metric (Additional file 1). For each cell type and α/β/γ-chain combination we defined a vector of clonotype counts in a sample. Here, clonotype is defined as CDR3, plus the full V, D and J gene segments without considering α−β or γ−δ chain pairing. Note that \(\delta\)-chain data is excluded, because recombination of the \(\alpha\) locus, which occurs first, preferentially removes the TCR-δ locus [26]. Next, we constructed a matrix of Euclidean distances between vector pairs, ensuring that each pair contained equivalent numbers of recombinant rearrangements by randomly down-sampling the more-populous sample (Additional file 1), a necessity as TCR rearrangement counts varied over two orders of magnitude among samples. Distances were calculated over a pre-set optimised range of α, the order of Rényi entropy. Next, cell type and α/β/γ-chain combinations were partitioned by clonotype, adapting a machine learning approach [27] (Additional file 1). Once distances were precomputed, investigators were unblinded to the group identity (e.g. MEmm) of each CD8+ sample. CD4+ data were acquired following unblinding.

Multidimensional scaling (MDS) plots

The distance matrix for all three CD8+ chain types (TCR α-, β-, and γ-chains) considered together was visualised using a MDS plot (Fig. 1). The four groups (MEsa, MEmm, MS and HC) are not clearly separated in this plot’s two dimensions. Location of seven MEsa, two MEmm and three MS disease cases away from the main cluster (\(x>2000\)) are likely due to age, rather than disease status, because among 21 variables (Table 4), only age (binned by decade) was strongly negatively and significantly correlated with x-axis values (\(p=2\times {10}^{-3}\)). This likely reflects a narrowing of immune repertoire diversity due to age-associated reduced thymic activity (thymopoiesis) [28], which also explains our finding that age group is a significant predictor (\(p = 8\times {10}^{-3}\)) of CD8+ TREC count. Four of seven of the MEsa outliers (Fig. 1) were from individuals aged 18-29 years at donation.

Table 4 Variables tested as potential confounders with MDS plot x-axis using a generalised linear model fit

MDS y-axis values were negatively correlated with β (TRB) locus coding junction count (\(p=1\times {10}^{-3}\); Table 5) motivating us to perform similar analyses that separated CD8+ rearrangements by each of the three loci: TCR α-, β- or γ-loci. Again, no clear separation was visible in two dimensions (Fig. 2A,B,C), or when a third dimension was added, or when different α-ranges and step size were used (Fig. 2D,E). Finally, we undertook this analysis for CD4+ (helper T) cells, separating by TCR α-, β-, or γ-chain. Although some structure was apparent, again we observed no clear separation between groups in two dimensions (Fig. 3).

Table 5 Variables tested as potential confounders with MDS plot y-axis using a generalised linear model fit

Potential support vector machine (P-SVM)

We next assessed if clonotypes of one group (e.g. ME cases) clustered separately from another (e.g. healthy controls) using the Potential Support Vector Machine (P-SVM) approach [29] with leave-one-out cross-validation. Classification boundaries between groups were defined by the P-SVM and classifications were tested against random expectation using permutation testing (up to \({10}^{3}\) label shuffles; Additional file 1; Fig. 4).

Our three primary aims were to test for a difference in the expansion of T-cell clonotypes inferred from TCR sequencing between: (i) ME cases (MEsa + MEmm) and healthy controls, or (ii) ME cases and MS controls, or (iii) MEsa and MEmm cases. Recognition that each hypothesis was tested six times (both CD4+ and CD8+ cells; and each of α-, β- or γ-chains) resulted in a Bonferroni multiple testing correction [30] of \(p<\frac{0.05}{18}\approx 0.0028\) for the 18 primary tests. Our six secondary aims were to test for T-cell clonotype expansion differences between: healthy controls and (a) ME or (b) MEmm or (c) MEsa cases; or between MS cases and (d) healthy controls or (e) MEmm or (f) MEsa cases for each of the two cell types and three TCR chains, i.e. \(6\times 2\times 3=36\) tests. Accounting for all 54 tests, our Bonferroni-corrected threshold for testing secondary aims was \(p<\frac{0.05}{54}\approx 9.3\times {10}^{-4}\).

Comparison of group classifications against random permutations yielded single-test p-values, each representing the likelihood that two groups had been separated by the P-SVM algorithm by chance. Four comparisons achieved significance at \(p<0.05\) (Table 6). Nevertheless, after appropriate Bonferroni correction no comparison remained significant.

We conclude that none of the null hypotheses that we investigated should be rejected.

Table 6 Uncorrected p-values from permutation testing for CD4+ and CD8+ cells, by α/β/γ chain type. p-values \(<0.05\) are indicated in bold. The † symbol indicates that fewer than 1,000 permutations were used to generate the p-value due to computational resource constraints

Discussion

We applied a method that determines statistical significance of the difference in TCR clonotype repertoires between cases and controls. With simulated data, the approach perfectly distinguished sample groups at a 1% p-value threshold (Additional file 1). Nevertheless, with experimentally-determined data the method detected no statistically significant differences between TCR clonotype diversities of people with ME and others with MS or healthy control individuals, or between the TCR clonotypes of severely affected versus mildly- or moderately-affected people with ME (MEsa and MEmm). Despite this study comparing a relatively large number of samples from people with ME (\(n=80\)) with those from healthy controls (\(n=40\)) or from disease controls (MS; \(n=40\)), and despite a large number of rearrangements (\(\sim{10}^{3}-{10}^{4}\)) being sequenced from each sample, no differences between groups were detected.

Future work could consider refining the analytical pipeline using samples from a disease with a more clearly established incidence of clonal expansion such as T-cell leukaemia/lymphoma [31]. A future study might apply the Berger-Parker index which is weighted towards greater α-values; in our study low α-values gave unstable estimates. It is likely that all current TCR repertoire studies are severely underpowered at present, due to current technology greatly undersampling the true T-cell repertoire [32].

Limitations

First, given its diverse initiating onsets and symptoms, ME may yet be found to have heterogeneous biomarkers, each predictive of only some patients. Without such biomarkers, research studies of this size may lack adequate power to identify molecular signatures, such as TCR repertoires, that are predictive of only a minority of patients. Discovery of biomarkers or clinical tests that aid ME diagnosis remains an urgent priority for future research. Second, a larger study restricted to individuals with particular HLA alleles may have greater predictive power, because HLA risk alleles modulate autoimmunity risk by increasing the frequency of autoreactive TCRs [33].

Third, even if ME is aetiologically homogeneous, our negative results could reflect a lack of statistical power to identify true differences. If sampling noise in our experiment’s clonotype counts obscured a clear signal, rearrangement numbers may need to be increased by several orders of magnitude, from \({10}^{4}- {10}^{5}\) sequences per donor investigated here, towards the \(\sim{10}^{11}\) naïve T-cells in human repertoires [2]. Additional predictive power could be gained by investigating the pairing of TCR chains (e.g. α−β) in single cells, although this would be at the inevitable expense of greater experimental cost.

Finally, this study’s results could also reflect an absence of clonotype diversity differences between groups. If so, then the TCR clonotype sequences themselves, rather than their diversity, could be predictive of disease status, or else TCR repertoire differences are manifest not in blood, but in other more disease-relevant tissues as in MS [34]. Our negative results could also reflect causal mechanisms of ME that do not result in T-cell repertoire change.

Materials and methods

Donor sample cell enrichment

160 samples of peripheral blood mononuclear cells (PBMC) were acquired from the CureME Biobank [21] from the same number of anonymised donors. A unique pseudo-anonymised identifier was assigned to each sample to allow the study to be blinded. Samples were stored at -180oC. CureME defines ME samples according to either or both of the Canadian Consensus and Fukuda criteria [21]. Four groups of 40 samples were provided: (i) Severely affected people with ME (either house- or bed-bound; MEsa); (ii) Mildly or moderately-affected people with ME (MEmm); (iii) people diagnosed with Multiple Sclerosis (MS; disease controls); and, (iv) Healthy controls (HC). As much as possible, these groups were age-matched, with the majority of donors being between 40 and 60 years old at the time of sample collection; all were from female donors (Table 1). Samples’ CMV seropositivity status was provided by the CureME Biobank based on the level of CMV immunoglobulin G (IgG) measured in their plasma [21]. Data generation and analysis protocols were developed and optimised using two CD8+ samples from healthy controls (American Type Culture Collection, Manassas, VA, USA).

PBMCs were enriched by Magnetic Activated Cell Sorting [35] using MACS Micro beads (Miltenyi Biotec) first to positively select CD8+ (cytotoxic) cells (CD8 MicroBeads, human 130-045-201), then CD4+ (helper/regulatory) cells (CD4 MicroBeads, human 130-045-101). After each selection step, material was retained for verification by Flow Cytometry staining using fluorescent conjugated antibodies specific for enriched populations (Miltenyi Biotec MACS 130-113-125 CD3 Antibody, anti-human, APC Clone: BW264/56, MACS 130-113-254 CD4 Antibody, anti-human, PE Clone: M-T466, MACS 130-113-157 CD8 Antibody, anti-human, FITC Clone: BW135/80). Enrichment results were visualised with plots of co-receptor expression using manually set gates to define discrete cell populations (Additional file 5). A minor batch effect was noted, but variability was more pronounced between individuals than between batches, so was considered to have minimal impact.

Sequencing library synthesis

Samples with > 1 × 105 enriched (CD8+ or CD4+) cells were lysed and DNA extracted using a QIAmp DNA Micro Kit (Qiagen). Samples returning > 1 µg DNA were fragmented by sonication with a BioRuptor (Diagenode) using 30 rounds at 20 s intervals, size-separated using gel electrophoresis and fragments between 250 and 350 base pairs excised and recovered using a Clean Gel DNA Recovery kit (Zymo). Yield was measured using a Quant-IT Picogreen DNA assay kit (Thermo Fisher Scientific). Fragments were polished to produce blunt ends using a NextNEB End Repair kit (New England Biolabs).

Custom designed UMI adapters were ligated to polished gDNA fragments (Integrated DNA Technologies). dsDNA adapted design incorporated, in order from the ligatable blunt end: (i) a 4 base-pair (bp) validation barcode, (ii) a 6 bp random Unique Molecular Identifier (UMI), (iii) a 4 bp library ID barcode, (iv) a 2 bp AcuI target cleavage site, (v) a 14 bp filler sequence, (vi) a 6 bp AcuI binding site, (vii) a 5 thymine, 2 uridine, 5 thymine single base run forming a closed hairpin cap preventing adapter concatemerisation. Following adapter ligation (New England Biolabs), Axyprep Solid Phase Reversible Immobilisation (SPRI) beads were used to size select and clean-up the DNA (Corning). The hair-pinned adapter ends were then opened with a NEB USER kit (NEB) to enable five rounds of enrichment PCR with a Kappa Hifi Polymerase \(2\times\) Mastermix kit (Roche). Amplicon yields were measured with PicoGreen (Thermo Fisher Scientific) and re-size selected by gel electrophoresis.

Target enrichment and sequencing

A target enrichment strategy was used to select library fragments containing signal and coding recombinations of the human T-cell receptor loci on chromosomes 14q11 (TRAD, α/δ locus), 7q34 (TRB, β locus) and 7p14 (TRG, γ locus). Biotin-conjugated 120mer ssDNA SureSelect enrichment baits were custom designed (Agilent SureSelect) to give \(2\times\) tiling depth across a 500 bp region centred on each predicted recombination signal sequence (RSS) breakpoint motif (IMGT/GENE DB v3.1.30, https://www.imgt.org; [36]). Baits for V regions (1,680) and J regions (996) were synthesised as separate libraries. Each library also contained 17 common baits designed across a 1 kb section of the 14q11 α/δ locus TRAC constant region with \(2\times\) tiling depth for process quality control and assessment of input library coverage (see Additional file 2). The bait capture library selecting for fragments with J region homology was employed initially to enrich samples batched into 1.5 µg pools of five previously UMI indexed sample libraries, 300ng per donor, using manufacturer’s standard protocols (Agilent). The output of each J region selection was amplified with five rounds of PCR using Kappa Hifi Polymerase \(2\times\) Mastermix (Roche). To enrich for recombinant library fragments containing both V and J elements this product was then used as input for a second round of selection with the V region bait library using the manufacturer’s standard protocols. Final capture products were amplified with ten rounds of PCR using Kappa Hifi Polymerase \(2\times\) Mastermix (Roche).

Enriched libraries were prepared for Illumina NovaSeq sequencing. AcuI (NEB) was first used to digest the post-selection amplification products to remove the majority of the custom adapter whilst retaining the 4 bp validation sequence, 6 bp UMI, 4 bp library ID barcode and a predictable, 2 bp “sticky” overhang. Custom designed barcoded Illumina compatible adapters with complementary 2 bp overhangs were then ligated to each library pool (IDT). Final library pools were enriched using five cycles of PCR, size selected by gel electrophoresis and sequenced on an Illumina NovaSeq platform using a \(2\times\) 250 bp PE sequencing protocol (Arizona Genetics Core).

Data analysis

A custom suite of Java programmes was used for library demultiplexing and assigning clonotypes. Read pairs were quality and length filtered (all bases with Phred \(>20\), >250 bp) then demultiplexed using both Illumina pool indexes and library specific barcodes to identify high quality donor-specific subsets of reads. Successfully validated 6 bp UMI sequences were identified from each read in a pair, concatenated to form a fragment specific 12 bp UMI sequence, recorded in the ID field of each read along with barcoding information. These sequences were then trimmed from the fragment ends and reads collapsed to make a library of \(1\times\) 250 bp SE QC filtered, UMI annotated and trimmed reads for each donor.

A bioinformatics library of 20 base unique sequence tags was curated to unambiguously identify each V, D or J element of the human T-cell receptor loci on chromosomes 14q11 (TRAD, α/δ locus), 7q34 (TRB, β locus) and 7p14 (TRG, γ locus). Tags were positioned 15 bp from the predicted element RSS site and reference sequences recorded (Additional file 3). For tags predicted to span known single nucleotide variants multiple alternative versions were produced to permit allelic discrimination. Where local homology prevented unique tag production, non-unique tags (rtags) shared by multiple elements were generated. Secondary discriminating tags (alt-tags) located at the nearest point of divergence were then used to differentiate between rtag elements during processing. Separate libraries of tags were produced for ‘coding’ and ‘signal’ flanks of each RSS. Libraries of donor specific \(1\times\) 250 bp SE reads were scanned bioinformatically to identify reads with homology to tags in pairwise conformations compatible with VDJ signal or coding recombination. Parallel comparison of VDJ junction identification was conducted using PEAR [37] pre-assembly of overlapping 250 bp PE reads to reconstitute, where possible, the full-length sequenced fragment.

CDR3 regions were defined for such reads as nucleotide sequences located between the predicted position of identified V, D or J element RSS cleavage sites or where homology to the expected reference approaching the RSS site diverged. Individual V-CDR3-J cassettes were determined to be productive or non-productive by scanning for a single ORF containing the known coding frames of each V or J element identified. Predicted in-frame stop codons were also noted and only productive rearrangements were analysed further. Each unique V-CDR3-J sequence was assumed to represent a discrete T-cell clonotype and sequencing reads with identical V-CDR3-J recombination were clustered within each donor library. PCR derived duplicates were identified and removed from these clusters using degenerate matching of fragment 12 bp UMI sequences, allowing for up to two base mismatches. Clonotype duplicates within a donor library with unique UMI sequences were assumed to represent T-cell clonal expansions. The number of unique clonotypes and extent of their expansion was then scored for each donor library to give an index of T-cell diversity.

Generalized linear models (GLMs)

To investigate MDS axes we implemented a generalized linear model using glm() in R (see Additional file 1).

Fig. 1
figure 1

Multidimensional scaling (MDS) plot of all CD8 + chains. Samples from MEsa are indicated in red, MEmm in yellow, MS in green and HC in blue. Axes indicate projected distances, such that the linear distance between two samples reflects the value in the distance matrix. There is no clear visual separation between the groups in two dimensions, but there are notable disease case (MEsa, MEmm and MS) outliers for \(x>\text{2,000}\). The plot was visualised using cmdscale() in R. MDS plots are invariant under rotation and reflection

Fig. 2
figure 2

Multidimensional scaling plots for the three TCR chain types of CD8 + cells. Panels A, B, C: Data from α, β and γ chains are shown. Samples from MEsa are indicated in red, MEmm in yellow, MS in green and HC in blue. No clear separation by group is visible in two dimensions. Panels D and E: MDS plots for α-chain CD8+ data showing changes to the distance matrix caused by adjusting the α-step size (Panel D, \(\alpha\)-range of \(\left[\text{0,20}\right]\) and a step size of 0.1) and \(\alpha\)-range (Panel E, \(\alpha\)-range of \([0,,10]\) and a step size of 0.2, as in [27])

Fig. 3
figure 3

Multidimensional scaling plots for the three TCR chain types of CD4 + cells. α, β and γ chain data analyses are shown clockwise from the top left. Samples from MEsa are indicated in red, MEmm in yellow, MS in green and HC in blue. No clear separation by group is visible in two dimensions

Fig. 4
figure 4

Analysis pipeline for the computational implementation of the TCR clonotype diversity analysis