Key words

1 Introduction

Once an adaptive immune receptor repertoire (AIRR) experiment has been carried out and the data has been appropriately preprocessed and annotated (see chapter “AIRR Community Guide to TR and IG Gene Annotation”), the next step is to plan a course of analysis to answer the questions posed by the experiment. As AIRRs are complex datasets that can contain thousands or even millions of sequences, it is important to have a working familiarity with the type of information each analysis can provide, as well as the limitations of an analysis. Here we provide an introduction to a variety of widely used techniques and discuss their applicability. In other chapters in this volume, we provide detailed experimental protocols and instructions to perform such analyses for the purpose of addressing specific biological questions. For a definition of terms used throughout this chapter, please see the AIRR Community glossary of terms, available at https://zenodo.org/record/5095381.

2 Materials

A breathtaking array of computational tools are available for repertoire analysis. These range from bespoke command line tools written in various programming languages that require facility in a Linux terminal to software with fully developed graphical interfaces and no requirement for programming skills of any kind. Thus, a key factor in choosing which programs to use will be the skill level and comfort of the user. Moreover, most tools have a narrow scope of the types of analysis they can perform, so matching the implementation to the desired goal is also a critical consideration. In addition, thought must be given to the computational resources necessary for repertoire analysis, including both storage and processing.

A comprehensive listing of the available software is out of the scope of this conceptual introduction, but the interested reader is directed to some recent reviews [1,2,3,4]. Here we focus on a small selection of commonly used tools, especially those which comply with AIRR Community guidelines for reproducibility and interoperability (https://docs.airr-community.org/en/stable/swtools/airr_swtools_standard.html). These are highlighted in Table 1, and several are discussed in more detail below and in other chapters in this volume, where we demonstrate their application to common analytical tasks.

Table 1 Software tools

3 Methods

In this section we introduce some of the most frequently used methods to analyze AIRRs and suggest computational tools that can perform such analysis. Some of the methods are applicable to both IG and TR , and some are specific. In addition, the selection of the method and the interpretation of the results can depend on the specific biological state; for instance, some samples might be expanded from solid tumors, others from antigen-specific cells isolated from peripheral blood or from whole blood from healthy and diseased patients. The theoretical framework presented here can be used to interpret the results of the practical methods detailed in the AIRR Community chapters “Bulk gDNA Sequencing of Antibody Heavy Chain Gene Rearrangements for Detection and Analysis of B-Cell Clone Distribution,” “Bulk Sequencing From mRNA With UMI for Evaluation of B-Cell Isotype and Clonal Evolution and Single-Cell Analysis,” and “Tracking of Antigen-Specific T Cells: Integrating Paired-Chain AIRR-Seq and Transcriptome Sequencing,” all in this volume.

3.1 Gene Usage

The V gene is the most diverse gene of the TR and IG loci. This is driven especially by variation in the first and second complementarity-determining regions (CDR1 and CDR2) of the genes, which contribute to the specificity and affinity of the immune receptor. Differences in the distribution of V genes used in the rearranged repertoire can indicate an antigen-specific response or unusual clonal expansions and can be evaluated with the function compareVGeneDistributions of the sumrep R package [20] (https://github.com/matsengrp/sumrep). The D and J gene strongly contributes to the CDR3 and can be compared using compareDGeneDistributions and compareJGeneDistributions. Skewing of the V-J usage can be revealed by plotting the V-J combination as a heatmap (Fig. 1a). The distribution of V-J and V-D-J usage can be compared between two repertoires using the functions compareVJDistributions and compareVDJDistributions in sumrep.

Fig. 1
figure 1

Data visualizations. Examples of different data visualizations to gain insights into the AIRR. The plot title describes the basic analysis. Smp = sample. For further details, please refer to the main text

3.2 Properties of the CDR3

The CDR3 is the most variable part of the rearranged IG /TR , and is a key contributor to the overall specificity of the receptor [26]. Therefore, analyzing the properties of this region is of great interest.

Due to the randomness in addition and deletion of nucleotides during the rearrangement of the receptor, CDR3 lengths will be distributed around a mean value (Fig. 1b). Any changes to this distribution signifies an expansion of cells with a particular immune receptor.

Different receptors specific for the same epitope can be expected to share motifs [27, 28]. Such motifs can be a few identical amino acids or amino acids with similar physical properties. Apart from properties like size, charge, and polarity, the properties of amino acids can be described by different factors derived through dimensionality reduction of a larger number of properties. Atchley [29] factors comprise five numerical descriptions, and Kidera [30] factors comprise ten numerical descriptions.

The R package sumrep [20] provides functions to compare the CDR3 properties of two repertoires, such as the CDR3 length and a number of amino acid physicochemical properties [31, 32].

3.3 Clonal Lineages

A clone or a clonal lineage comprises a group of T or B cells descended from the same original naive ancestor. As such, all cells in a clone contain the same set of rearrangements. An important part of AIRR-seq analysis is computationally reconstructing these relationships from the sequences obtained. For TRs, the exercise is relatively straightforward, as only PCR and sequencing error need to be accounted for. With IGs, however, somatic hypermutation can significantly obscure the ancestry of a particular sequence [33], and so more complex strategies are required (see Subheading 3.9.2).

When analyzing bulk AIRR-seq data, in which native pairing between heavy and light, alpha and beta, and gamma and delta chains is lost, clones are sometimes defined based on a single chain. This may be sufficient for IGH and TRB rearrangements, which are more diverse and contain most of the information needed to group sequences into clonal lineages [34]. However, care should still be taken in interpreting such data. Many different definitions of clonally related sequences have been offered in the literature (e.g., see the work by Kotouza and co-workers [35]), and methods to infer clones from AIRR-seq data are under active investigation [6, 12, 14, 18, 36].

The distribution of clone sizes in an AIRR can be informative of underlying biology. One visualization is to plot ranks from high to low on the x-axis and associated frequency on the y-axis (Fig. 1c) to reveal clonal expansion. A closer look at the top x (Fig. 1d) helps likewise to identify clonal expansion. When plotting the log of rank and the frequency (Fig. 1e), the slope reveals the distribution of clones, such that the steeper the slope, the less evenly distributed the repertoire. The function estimateAbundance in the R package Alakazam [6] can estimate clonal abundance with confidence intervals obtained by bootstrapping.

An alternative visualization makes use of division of clonal frequencies into different groups (“binning”) and sums the frequencies in each bin (Fig. 1f). The binning is essentially arbitrary, but binning the clone frequencies into the bins [0.0, 0.001], [0.001, 0.01], [0.01, 0.1], and [0.1, 1] are widely used. Binning by rank is an alternative where the bins [1, 10], [11, 100], [101, 1000], and [1001, inf] are common.

3.4 Diversity

The concept of diversity unites two properties of a repertoire, namely, the number of distinct clones and their distribution. As such, diversity describes the composition and state of a repertoire. For instance, a repertoire derived from a completely naive cell population is much more diverse both in terms of distinct clones and their distribution compared to the repertoire of antigen-specific memory cells.

There are numerous sampling factors that are important to consider when measuring diversity. Perhaps the most important is whether a sample is derived from gDNA or mRNA [37]. As discussed in Subheading 3.1 in chapter “AIRR Community Guide to TR and IG Gene Annotation,” in the case of gDNA, each sampled cell contributes one or two templates, while the number of templates in mRNA data will be skewed by cell subset-specific transcript abundance. In the case of the former, diversity measures will be influenced substantially less by the underlying subset distribution than the latter. For both, one can measure diversity weighted by copy number or by clone number. For DNA data, using copy number-weighted diversity measures can give a sense of how similar sequences are in the underlying repertoire while using an unweighted measure will indicate how similar clones are. With RNA, using copy number-weighted measures will give a general measure of how similar large clones are, and unweighted measures will give a measure of how similar all clones are.

Another consideration when analyzing diversity is the depth of sequencing, that is, the proportion of clones that were sequenced compared to how many were actually in the sample. Assessing appropriate sequencing depth is no trivial task, but very important as undersampling can lead to false conclusions. Rarefaction curves [38] can help to evaluate if a repertoire is near full sampling depth. In this visualization, the number of distinct clones are plotted for a given subsample size (Fig. 1g). If the numbers of distinct clones plateau, the repertoire is near full sampling depth. Conversely, the absence of a plateau is an indication that the sampling depth of repertoire is shallow.

Another use of rarefaction is an estimation of the total number of clones from the sample. To achieve this, libraries from the sample of interest must be run in replicates, where more replicates give a more accurate estimate of total clones [39].

There are a large number of diversity metrics. These different metrics are all united in Hill numbers which are calculated over a range of diversities to generate a smooth curve (Fig. 1h) [40,41,42]. The function calcDiversity in the R package Alakazam estimates the Hill numbers for a repertoire. The same function also makes calculation of particular diversity indices straightforward. The function compareHillNumbers in the R package sumrep compares one or more Hill numbers of two repertoires. Newer approaches toward diversity metrics specific for AIRR make use of Hill numbers combined with a functional similarity matrix [43].

3.5 Similarity of AIRR Sequences

The similarity of AIRR sequences directly influences antigen recognition breadth: the more dissimilar the receptors are, the larger is the antigen space covered. One major approach to interrogate and measure AIRR sequence similarity is network analysis (Fig. 1i) [44,45,46,47,48,49,50]. Networks allow investigation of sequence similarity and thereby add a complementary layer of information to repertoire diversity analysis. Sequence networks are built by defining each nucleotide or amino acid sequence as a node. Two nodes are connected with an edge if a certain similarity condition is satisfied, which is typically defined as a string distance (e.g., Levenshtein/edit distance). A commonly used distance for both IG and TR is one amino acid difference [44]. For B cells, networks representing amino acid distances of up to 12 amino acids have been reported [47]. Building a sequence similarity network is computationally expensive. This challenge has been approached by at least two methods that allow the construction of large-scale networks from millions of AIRR sequences [47, 51].

Although networks of a few thousand nodes may be visualized using software suites such as igraph, Cytoscape, and Gephi [52, 53], and the visual interpretation of networks becomes indiscernible with a size of >102 nodes. Furthermore, the visualization of networks does not provide quantitative information regarding the network similarity architecture. To address this problem, graph properties and network analysis have recently been employed to quantify the architecture of large-scale AIRR networks [47]. Architecture analytics may be subdivided into properties that capture the repertoire at the global level (generally one coefficient per network), and those that describe the repertoire at the local level (one coefficient per sequence per repertoire). These network measures may be used to identify enrichment of network clusters (Fig. 1i), potentially originating from an ongoing immune response [46, 47].

To increase precision in isolating immune-associated AIRR sequences and clusters therefore, network analysis may be coupled with AIRR generation probabilities [45]. More generally, it has been observed that sequences that tend to show increased sharing across individuals (discussed in the see Subheading 3.7), are also more connected within a repertoire [45, 47, 48] and confer robustness on its architecture with respect to network properties [47].

Recently, sequence similarity and diversity analysis have been combined, providing further insights into AIRR architecture [43].

3.6 Similarity among Repertoires

Similarity indices measure the similarity of two populations by not only considering the number of shared clones but also taking clone count or frequency into account (Fig. 1j). Similarity is sometimes calculated as dissimilarity (for historical reasons), but the index is always in the range of [0, 1]. It is therefore important to indicate the meaning of 0 and 1 to avoid confusion. One of the most popular indices is called Morisita-Horn, implemented in the function vegdist in the R package vegan [54]. Numerically, the observed overlaps are usually small, but considering the potential repertoire being sampled, the upfront chance of an overlap is very small. Alternatively, the CDR3s shared between samples can be plotted as a true/false heatmap (Fig. 1k). This is particularly useful when tracking clones over time or assessing the specificity of transplant infiltrating cells [55, 56].

Similarities on other parameters such as different amino acid properties as well as pairwise CDR3 distance and GC content can be compared between repertoires by the function compareRepertoires in the R package sumrep.

Other proposed similarity measures make use of feature counting [57], while another B-cell-specific similarity metric focuses on identical CDR3 length together with identical V and J genes considered within and between repertoires [58].

3.7 Public Clones

Though not clones in a true biological sense, the existence of identical TRs and identical or closely similar IGs in multiple individuals due to convergent rearrangement has been noted on several occasions [59,60,61]. Such rearrangements are termed public clones and can yield insights into common selection patterns, which in turn can elucidate how the immune system responds to disease and if there are commonalities between individuals. The ability to identify public clones in an AIRR depends on the sequencing depth and the number of individuals tested [62, 63]. In addition, the meaning of a public immune receptor must be assessed in the context of the likelihood for it to be generated [8, 13]. Receptors with shorter CDR3s are more likely to be generated by chance and can overlap even between individuals with no exposures in common [60, 64, 65] and do not necessarily indicate a convergent response in multiple individuals to similar antigens. Sequences that share the same (preferably longer) CDR3 amino acid sequence but have different nucleotide sequences are more convincing as candidate public clones, as differences in the nucleotide sequences may indicate independent generation with convergent selection [66].

Functionally identical IG can be identified by allowing some degree of difference in the CDR3 . There is no well-defined cutoff to ensure the capture of a majority of receptors with identical specificities without including IGs of unrelated specificity into a particular collection of public IGs. A commonly used cutoff is 10–20% amino acid difference in the CDR3 [67,68,69,70]. Although a less restrictive cutoff might detect more divergent public clones [71], care must be taken to avoid identification of spurious public immune receptors [72]. Cross-contamination and index hopping on the sequencer further complicate the identification of public clones [73], and suitable definitions and analysis parameters may be helpful.

3.8 Detection and Monitoring of Cross-Sample Contamination Events

Despite strict quality assurance and control measures, PCR-based sample cross-contamination can occur at any time. Environmental contamination events are expected to arise from the presence of remaining DNA amplicons, which can be re-amplified and incorporated into new, unrelated libraries [74]. PCR contaminations can lead to major losses of reagents, time, and samples, and rapid detection and isolation are critical to the health of an AIRR-seq research laboratory. There are several experimental precautions that can reduce contamination, including separate work areas and different sample barcodes, as illustrated in the AIRR Community chapter “Quality Control: Chain Pairing Precision and Monitoring of Cross-Sample Contamination.”

3.9 B-Cell-Specific Aspects

3.9.1 IG SHM Analysis

SHM is the process driving the affinity maturation of IGs during the adaptive immune response [75]. Mutations are introduced at a rate of ~10−3 mutations per base pair per division. These mutations are not randomly distributed along the IG but accumulate more in hotspots and CDRs, whereas coldspots and framework regions are disfavored for mutation. Furthermore, substitution profiles may be germline gene-directed [76,77,78,79], possibly as a consequence of specific features of the encoded protein sequence. Understanding SHM biases is key to develop better tools to reconstruct lineages, quantify selection pressure, and generate realistic simulated sequence data [9, 79, 80].

To better understand the distribution of targets for SHM , it is, for instance, possible to use the R package sumrep that provides two functions getHotspotCountDistribution and getColdspotCountDistribution to the distribution of the hot- and coldspot motifs in the repertoire. In addition, sumrep interfaces with the R package SHazaM [6], which calculates a mutability model for the likelihood for the center base in a 5-mer to be mutated (the function getMutabilityModel). The associated function getSubstitutionModel provides the relative probabilities that the center base in a 5-mer is mutated into each of the other three nucleotides. SHazaM also provides methods for quantification of selection pressure and whether it has contributed to the nature of the specific IG repertoire during antigenic stimulation [81].

3.9.2 Identification of B-Cell Clones

As noted above, B-cell clones can be inferred from AIRR-seq data by analyzing their CDR3s and/or mutation patterns (Fig. 1l). Repertoires usually consist of hundreds or thousands of clonal lineages. Due to the presence of SHM , members of a B-cell clone cannot be identified solely based on identical CDR3s. There are many methods available to group IGs into clonal lineages (Table 1), but all generally attempt to computationally group sequences which likely share a common progenitor. However, different approaches can drastically change the interpretation of the underlying IG immune repertoire.

Some approaches begin by grouping sequences by their CDR3 independent of their V, D, or J gene usage [22]. Other software first groups sequences by gene (generally just V and J due to the difficulty in D gene annotation) and CDR3 length after which sequences similar in the CDR3 are grouped into clonal lineages [12, 19, 82, 83]. SCOPer does a similar grouping, but then evaluates the similarity by analyzing shared SHM in the V and J genes [84]. Finally, some pipelines use common mutations in the body of the V gene to group sequences from the same clonal lineage [36, 85]. It is also possible to combine these approaches, but this section focuses on each independently.

Each approach has potential benefits and flaws. Initially grouping sequences by CDR3 , either by identity or hierarchical clustering, can result in inflated copy number and sequence counts for common CDR3s (in particular those of short length that incorporate few non-templated bases) which may have arisen independently and utilize different genes. However, this method can be beneficial as some gene calls may be incorrect (in particular when annotation of sequences has not been made using a personalized repertoire as defined above), and similar CDR3 amino-acid sequences, especially those with long lengths, can indicate that sequences are related.

Grouping sequences by both gene annotation and CDR3 length prior to inferring clonal lineages can be beneficial for a number of reasons. Because V gene annotation is generally robust to sequencing error, sequences with similar CDR3s but different V gene assignments are unlikely to derive from the same rearrangement. Binning by gene annotation can therefore prevent erroneous clonal groupings. It also eases the computational burden, as CDR3 identity only needs calculation among smaller sets of sequences. Similar advantages apply to binning by CDR3 length as well, since distance metrics can be calculated more efficiently without the need for alignment. While insertions and deletions can occur as part of SHM , they are relatively rare [86, 87] and can be neglected in many cases.

Once sequences have been binned, hierarchical clustering is a common technique for identifying clonally related sequences [82]. This requires a choice of linkage (e.g., single, average) to define the distance between groups of sequences and a threshold for cutting the hierarchy into discrete groups. A convenient way to set the threshold is to analyze the distribution of distances between nearest neighbors. This distribution is typically bimodal, with the first mode representing sequences in the same clonal lineage, while the second mode represents sequences that do not have any relatives in the data. If the distribution for a particular sample is not bimodal, a set of external sequences from a different subject can be used to establish the threshold [82]. While the threshold for separating the two modes can sometimes be established by visual inspection of the distribution, there are algorithmic methods to determine it more consistently [18].

The last common approach is to group sequences into clones by common mutations in the body of the V gene. This can be done by constructing clonal lineages directly or by inspecting the k-mers of each sequence [36, 88]. Unlike methods that first separate sequences by gene call and junction length, this method takes advantage of infrequent mutations to group sequences into clones. This can be beneficial for a number of reasons in certain circumstances. First, this method does not rely on proper gene calling or sequence alignment, which can be difficult in samples containing highly mutated populations or more generally due to sequencing error. Additionally, it is not sensitive to junction length, allowing sequences that have accumulated insertions and deletions to be grouped into clones [89, 90]. This method necessitates one to define the minimum number of mutations required to group two sequences into the same clone. A fixed value can be used, or the value can be dynamically determined based on the distribution of distances between each pair of sequences.

3.9.3 IG Affinity Maturation

The reconstruction and analysis of IG clonal lineages trees is a powerful method to understand the immune response, affinity maturation, and the generation of broadly neutralizing antibodies (bNAb) [91,92,93]. Within a B-cell clonal lineage, B cells descended from a shared common ancestor evolve through SHM and antigen-driven selection. While standard algorithms for inferring phylogenetic trees using maximum parsimony and maximum likelihood [94] are often employed, these approaches can be improved [80]. In particular, the unique biology of B cells can present problems for standard phylogenetic approaches and has led to the development of B-cell-specific phylogenetic tools. One cause of the problems is that SHM is enzymatically driven and biased by hotspot and coldspot motifs. This violates the assumption of independent evolution among sites that many likelihood-based phylogenetics methods rely on. To address this challenge, more context-aware phylogenetic methods, such as IgPhyML [9, 10], have been developed. While context-aware models of SHM clearly improve estimates of phylogenetic model parameters used to detect antigen-driven selection [10], it is less clear how much they improve estimates of tree topology and branch lengths [95]. Another problem is that while standard phylogenetic models consider clonal lineages individually, IG repertoires often contain hundreds of independent clones. The use of repertoire-wide models, which allow some parameters to be shared among these multiple clonal lineages, can improve model precision significantly [10]. One important application of B-cell phylogenetics is estimating the series of mutations leading from a clone’s unmutated germline ancestor to a sequence of interest, such as a known bnAb sequence. While standard phylogenetic methods can reconstruct intermediate sequences, they are less appropriate for reconstructing the germline ancestral sequence because they do not take into account the biology of V(D)J rearrangement. This has led to the development of tools such as Clonalyst and linearham [96, 97] that improve the reconstruction of these sequences by combining phylogenetic models with models of V(D)J rearrangement. Another feature of B-cell clonal lineages is that reconstructed intermediate sequences are often identical to observed IG sequences. Some tools, such as IgTree [98] and Alakazam [6], use this fact to simplify the visualization of these lineage trees by collapsing observed and sampled intermediate nodes. Finally, lineage trees containing B cells from multiple tissues, isotypes, and timepoints have the potential to be used to make inferences about how B-cell migration, isotype switching, and evolution over time occur. Multiple analyses have used lineage trees for this purpose [33, 40, 99, 100], and generalized tools for making these inferences from B-cell repertoires, such as Dowser and PopTree, are an area of active development [7].

3.10 T-Cell-Specific Aspects

There is growing evidence that TR repertoire perturbations can serve as a biomarker of immune response toward some solid tumors [101,102,103] and pathogens such as Epstein-Barr virus (EBV), cytomegalovirus (CMV), Ebola, and SARS-CoV-2 [104,105,106,107,108]. Challenges with studying T-cell repertoires include the dependence of T-cell interactions on the major histocompatibility complex (MHC) [109], changes in TRBV usage based on MHC and significant differences in TRBV usage, and clonality in CD4+ and CD8+ repertoires [110,111,112].

Antigen-specific TCRs can be isolated either by sorting of MHC-tetramer-positive cells or activated cells after stimulation with overlapping peptide pools. Staining with tetramers requires knowledge of the correct epitope in the right MHC context, and T cells with high affinity tend to be recovered with the highest efficiency. Therefore, tetramer staining sometimes fails to identify some of the relevant TCRs [113]. Stimulation with overlapping peptide pools, on the other hand, can lead to isolation of non-peptide-specific T cells due to bystander activation [114]. The TR of the antigen-enriched cells can be compared to samples from different timepoints to track the frequency of clones of interest [104, 106].

4 Conclusion

In this chapter, we have provided a brief overview of diverse, widely used techniques to uncover biological information in AIRR-seq data. These techniques can be applied to all of the AIRR-seq data created using the methodologies described in this book. They further form the basis for selecting the optimal experimental protocol to address the biological question and choosing the computational methods used in the analysis.