Key words

1 Introduction

Once an Adaptive Immune Receptor Repertoire sequencing (AIRR-seq, please see the AIRR Community glossary at doi: https://doi.org/10.5281/zenodo.5095381 for definitions of key terms) experiment has been successfully designed and carried out (see discussion in the Chap. 15, attention turns to analyzing the data collected to produce biological insights. Many of the same factors that influenced choices in experimental design will be important in planning the computational approach as well. AIRR-seq data to be analyzed may have been generated from genomic DNA or mRNA, with or without unique molecular identifiers (UMIs), and in bulk or single-cell context, as described in the Chap. 15. Each of these alternatives may require (or preclude) the use of certain software tools and influence the interpretation of the analysis. In addition, thought must be given to what computational and storage resources will be necessary given the size of the dataset and the intended analysis.

A clear first decision point in AIRR-seq data analysis is whether IG or TR repertoires are being analyzed (Fig. 1). While many tools such as MiXCR [1], IMGT [2], and others (Table 1) can handle both types of data, some are specific to one or the other. In addition, interest in specialized inquiries like phylogenetic analysis of IGs or calculation of clonal dynamics may require additional specific tools. In such a case, it may be useful to work within a particular ecosystem like Immcantation (http://immcantation.org), VDJServer [18], or SONAR [12], which provide several tools for a thorough analysis from quality control to clonal analysis, to facilitate smooth workflows.

Fig. 1
figure 1

AIRR-seq decision points. The different ways an AIRR-seq experiment can be constructed. Each choice has implications both for the experimental methodology and for the design of an appropriate analysis strategy

Table 1 Software tools

The most critical set of considerations revolve around the origins of the molecules that were actually loaded into the sequencer (see Chap. 15). They may have been initially amplified from genomic DNA or from mRNA; the former results in exactly one initial copy of each productive V(D)J rearrangement in a cell, while the latter starts with several or many copies and may vary with cell type and activation state. When amplifying mRNA, the initial molecules may also be labeled with UMIs, which enable the correction of errors introduced by PCR and/or sequencing by identifying reads that are derived from the same original molecule. Of note, while the usage of UMIs enables experimental error correction, their usage necessitates a considerably larger sequencing depth due to consensus read building (for a more nuanced discussion, see, e.g., [20, 21]). UMIs may also be used when sequencing DNA, but that is currently less common in practice. UMIs can also be used to improve quantification, by collapsing apparent expansions due to differential amplification. Some specialized UMI protocols may also require particular matched software tools to fully utilize the advantages of those schemes [22]. Without UMIs, it is advisable to cluster highly similar reads to avoid overcounting, particularly for IG sequences, where errors and somatic hypermutation (SHM) are often indistinguishable.

It is also important to think about how molecules from the full repertoire get included into the pool to be amplified for sequencing. For mRNA-derived libraries, in particular, the efficiency of cDNA generation can be a significant bottleneck and may vary depending on the enzymes and protocol used in the reverse transcription (RT) reaction [23, 24]. The efficiency of the RT reaction can lead to a bias toward abundant species in the repertoire and concomitant dropout of rare ones. In addition, because of the diversity of V and J genes and their surrounding genetic context, many protocols use pools of primers to capture the full repertoire [25]. However, these primers may have different efficiencies in amplifying their respective targets, and some genes might be targeted by more than one primer in a pool. Other protocols circumvent this problem by adding 5′ anchors during reverse transcription [26]. In addition, IGs with high SHM can lose their ability to bind to an intended primer, resulting in the depletion of these sequences from the measured repertoire.

Recently, several high-throughput technologies have become popular for conducting AIRR-seq at single-cell resolution. These provide the most accurate, direct measurements of repertoire statistics and allow more biologically accurate definitions of clones. To do so, however, requires analysis tools that are capable of keeping heavy/light, alpha/beta, or gamma/delta chain sequences linked. The AIRR Community [27] (https://www.antibodysociety.org/the-airr-community/) is developing standardized representations for “receptors” and “cells” to facilitate these analyses and ensure data portability. In addition, single-cell IG and TR data can be easily linked to transcriptomic and other measurements for more comprehensive analyses.

The sequencing technology used must also be taken into account. Illumina paired-end sequencing requires an additional preprocessing step to reassemble the amplicon, and this may result in a bias against longer sequences, with less overlap between the two reads. Meanwhile, more error-prone long-read technologies require extra attention to quality control.

This chapter aims to guide bioinformaticians through the first steps in repertoire analysis, specifically the considerations and preparation of raw data for subsequent repertoire analysis (see Chap. 17). Firstly, this chapter provides in-depth information on the materials necessary to conduct the analysis, including computational resources for data preparation, available software tools, and germline database information (Fig. 2). The main portion of the chapter then discusses the considerations on data preprocessing and annotation of raw sequences with a reference germline database.

Fig. 2
figure 2

Process overview. Conceptual steps in designing an AIRR-seq analysis, proceeding from raw inputs to annotated sequences for downstream analysis

2 Materials

2.1 Computing Resources

AIRR-seq data are usually large and require specialized analysis methods and software tools. A typical Illumina MiSeq sequencing run generates 20–30 million 2 × 300 bp paired-end sequence reads which roughly corresponds to 15 GB of sequence data to be processed. Other platforms like NextSeq, which is useful in projects where the full V gene is not needed, creates about 400 million 2 × 150 bp paired-end reads. Because of the size of the datasets, the analysis can be computationally expensive, particularly the early analysis steps like preprocessing and gene annotation that process the majority of the sequence data. A standard desktop PC may take 3–5 days of constant processing for a single MiSeq run, so dedicated high-performance computational resources may be required. The institution may provide a cluster with high-performance computers for running analysis jobs. Commercial services like Amazon Web Services or Google Cloud can provide access to compute resources. However, this may come at added costs and could carry with them privacy concerns. Alternatively, there are free computing resources available. For AIRR-seq data, VDJServer provides free access to high-performance computing at the Texas Advanced Computing Center (TACC) through a graphical user interface [18]. VDJServer has also parallelized execution for tools such as IgBLAST, so more compute resources are utilized as the size of the input data grows. Analysis that takes days on a desktop PC might take only a few hours on VDJServer. An example workflow is provided in the AIRR Community Chap. 22 with instructions about using VDJServer for immune repertoire analysis.

2.2 Software Tools

Many tools are available for the first steps in AIRR-seq analysis [28,29,30,31]. Table 1 highlights several of the more commonly used programs. These are noted particularly because they support standardized AIRR data representations and are mostly free and open source, two key criteria among the AIRR software guidelines (https://docs.airr-community.org/en/stable/swtools/airr_swtools_standard.html). When deciding what are the right software tools to analyze data, besides computational requirements and expertise of the user, we recommend taking into consideration whether these tools use the AIRR Community standards and are AIRR-compliant. Tools that use the standard can easily be incorporated into complex workflows with other tools that share the same data format. Selecting AIRR-compliant software adds an additional layer of transparency to the analysis, because the source code is (1) available for inspection on a publicly available repository, (2) uses a versioning system, (3) has been tested, and (4) is available as a container (Docker, Singularity), among other quality requirements. The use of AIRR standards and of AIRR-compliant software supports the transparency, reproducibility, and rigor of research results.

2.3 Germline Databases

IG and TR germline databases are a requirement for accurate AIRR-seq analyses, regardless of the technique used (e.g., single cell vs. bulk). These databases guide the assignment of sequences to known and novel IG and TR genes/alleles, facilitating downstream sequence annotation and the accurate assessment of various repertoire features (e.g., gene/allele usage, SHM , clonal assignment, etc.; see AIRR Community Chaps. 18–20 for more detail). A germline database should ideally contain the most comprehensive and accurate set of possible IG /TR V, D, and J genes and alleles that best represent the genomic content of an organism. There are various sources of reference germline databases available, and occasionally a tool is limited by which database can be used for a particular analysis. Thus, the use of a particular database, or a combination of databases, may vary depending on the experimental objectives, as well as the particular species in which the AIRR-seq data has been generated. We therefore recommend investing effort in obtaining as accurate a database as possible. Table 2 describes currently available databases, focusing on those that are in active development.

Table 2 Germline reference databases

IMGT [2] provides the most commonly used reference genome databases, but even for species of substantial research interest, these do not represent species diversity and can contain sequences reported in error [35, 36]. For TR genes and for IG genes from nonhuman species, however, few or no satisfactory alternatives exist. Ongoing initiatives seek to remedy this by continuously improving germline databases across species. Several programs are available to infer personalized databases from AIRR-seq data for each experimental subject (Table 1). VDJbase (https://www.vdjbase.org) is a resource that brings together AIRR-seq and genomic information to study population diversity and identify previously unreported alleles [34]. In 2019, the AIRR Community established the IARC (Inferred Allele Review Committee) to evaluate, document, and name human IGH alleles inferred from AIRR-seq data [37], and it is anticipated that this approach will be extended to other species and loci over time: The IARC’s work is supported and published by OGRDB (the Open Germline Receptor Database, https://ogrdb.airr-community.org), which provides full information regarding alleles, metadata on the repertoires from which they originated, and ref. 32.

3 Methods

Preprocessing and gene annotation of AIRR-seq data takes as input the sequencing files and returns a set of high-quality sequences for which V, D, and J allele calls can be made and structural elements can be identified. After further quality control filtering steps, a final set of sequences is selected and can be used to carry out more in-depth analyses (see Chap. 17). All steps should be carefully documented to maintain data provenance and allow the analysis to be reproduced; the AIRR Community has defined a set of MiAIRR data processing fields to standardize the representation of analysis steps [38]. Below, we outline the concepts involved in each phase of analysis and then supply detailed protocols, applying them to common use cases. We also provide further information on reporting and sharing AIRR-seq data.

3.1 Preprocessing

While there are several experimental technologies available for AIRR-seq studies from different experimental setups, most approaches typically produce the same raw data file format (.fastq) and share the ultimate goal of obtaining a final set of reads of high quality, particularly in the complementarity-determining region 3 (CDR3 ) region, representative of each B or T cell in the repertoire. The general steps that need to be performed include (1) filtering reads (e.g., removing PhiX spike-ins, short reads, and reads with a low Phred score or excessive ambiguous base calls), (2) identifying and removing primers and sequencing barcodes (if present), (3) building consensus sequences (using UMI or cell barcodes, if present), (4) merging mate pairs (if using a paired-end protocol), (5) masking low-quality positions, (6) annotating with constant (C) region (if present), and (7) collapsing duplicate sequences. For some of these steps, some considerations and adjustments need to be made depending on whether the data are from genomic DNA or RNA, B cells or T cells; bulk or single cell, paired or unpaired chains, and whether UMIs have been used (Fig. 1).

In the following we describe the important considerations to be made when preprocessing AIRR-seq samples.

3.1.1 Filtering by Sequence or by Clone

Current NGS methods introduce occasional base-call errors which may not be detectable from the associated quality scores. A common approach to avoid incorporating these sequences in downstream analyses is to threshold data based on the frequency of reads. This does not eliminate such errors but can reduce their influence on gross metrics of the underlying immune repertoire. To remove spurious sequences, a common approach taken, e.g., by MiXCR [1] and SONAR [12], is to collapse identical or near-identical sequences and drop those with fewer than a specified number of reads (usually two or three). This approach is preferred where individual sequences may be of low quality, for instance, if sequencing depth is low. However, this approach to filtering can result in nonuniform loss of data when libraries of different sequencing depths are compared. Alternatively, instead of a preprocessing step, all sequences passing quality control checks can be grouped into clones using the regular workflows described in the AIRR Community method Chaps. 18 and 19, and then clones that include fewer than the specified number of unique sequences are removed prior to downstream analysis. This may be appropriate for high-quality sequences, such as with UMIs and sufficient sequencing depth for robust error correction. Without this correction, errors in the CDR3 can lead to the inference of spurious clones.

3.1.2 Read Length-Related Effects

Long paired-end reads provide useful information for reliable V gene assignment as well as more comprehensive mapping of SHM in the case of IG gene rearrangements [39]. As read length increases, the quality of base calls degrades as sequences are generated, but paired-end sequencing allows for computational alignment of the overlapping regions. After alignment, sequencing errors at the ends of the sequences can be reduced as the higher-quality base call for each position that overlaps can be used. However, for longer sequences such as with RNA libraries capturing the constant region, the read length on the sequencer may need to be increased, reducing the overlapping portion of the 5′ and 3′ reads, resulting in a bias against sequences encoding longer CDR3 . Further complicating this issue, a common procedure is to trim the ends of reads of low-quality stretches of base calls, such as with generic tools like fastx-toolkit or pRESTO’s FilterSeq trimqual [4]. This can in turn reduce the number of full-length high-quality sequences. On the other hand, with RNA-based sequencing, UMIs can be incorporated at the cDNA synthesis step, and, when coupled with very deep sequencing, these can be used for error correction through the construction of consensus sequences that share the same UMI . There is, however, a trade-off between the sequencing depth required for adequate coverage of UMIs and the number of independent sequences that can be sampled.

Long reads covering the entire variable region can also be generated using alternative sequencing platforms, such as those offered by Pacific Biosciences and Ion Torrent [31, 40,41,42,43]. These offer the additional advantage of being able to capture large enough parts of the C-region to be able to distinguish between subtypes of IgG. However, lower throughput on these platforms limits the depth of sampling that can be achieved.

Short reads are sometimes used to generate large quantities of data on CDR3 sequences, as sequencing short reads can be done on higher-throughput sequencers at lower cost. This strategy is particularly common for TR rearrangement analysis on gDNA using commercial platforms such as Adaptive. Short reads may be required if the template is of low quality, as sometimes occurs in formalin-fixed paraffin-embedded samples. Short reads can sometimes compromise TRBV gene assignments but are particularly problematic for IGH gene rearrangements with SHM . Short IGHV gene sequences result in larger numbers of ambiguous V gene assignments which can cause erroneous clustering of unrelated sequences into clones.

gDNA vs. mRNA templates. When using genomic DNA as starting material, each cell contributes a fixed number of IG or TR template, providing a parsimonious and cost-effective means of profiling large numbers of cells. gDNA-based sequencing will also capture far more nonproductive gene rearrangements than mRNA-based sequencing. With RNA, nonproductive rearrangements are subjected to nonsense-mediated degradation (although some nonproductive rearrangements can be recovered). gDNA is also more stable than RNA. On the other hand, RNA-based sequencing is more sensitive, with more templates per cell. With mRNA-based sequencing, cells contribute different numbers of templates, based upon cell subset-specific differences in transcript abundance. With mRNA-based libraries, cells can be grouped into subsets using immunophenotyping or single-cell RNA-seq to control for these differences. In the case of IG data where primers can be designed to capture the C-regions, each read can be annotated with its isotype using, for example, pRESTO’s MaskPrimers routine. Further, unlike gDNA, it is straightforward to incorporate unique molecular identifiers (UMIs) at the RNA to cDNA synthesis step. Each UMI , which should be unique to original individual cDNA templates, can be processed with pRESTO’s BuildConsensus to generate consensus sequences which can nearly eliminate sequencing error given sufficient sequencing depth [44, 45]. MiXCR, SONAR, and other packages also offer similar tools. The necessary depth might be difficult to achieve, though, for instance, in cases of vastly different expression levels or with samples of large size.

3.1.3 Productive Vs. Nonproductive Rearrangements

For each sample, the fraction of productive rearrangements can be an informative metric. On average, it can be expected that approximately 80% of TRB rearrangements and approximately 85% of IGH rearrangements sequenced from mature T or B cells will be productive [46]. Lower frequencies of productive rearrangements can be observed in immature lymphocytes, where selection has not yet been imposed on cells without productive rearrangements [47]. Lower frequencies of productive rearrangements can also be seen in sequencing libraries that are of poor quality. Nonproductive sequences also can be used as a baseline estimator of gene usage frequency in rearrangement [48, 49] and compared to productive sequences to investigate the effects of tolerance checkpoints on the AIRR [50, 51]. With such comparisons, it may be useful to remove clonal lineages that contain both productive and nonproductive versions of the same rearrangement, as sequencing errors can cause a sequence to appear nonproductive. Nonproductive rearrangements are sometimes also useful for identifying clonal expansions in tumors, particularly if tumors harbor SHM that may interfere with primer binding (the nonproductive rearrangements are usually un-mutated). Nonproductive rearrangements can be found in lymphocytes that have undergone multiple rounds of V(D)J recombination, as can occur with receptor editing; the presence of more than one rearrangement is particularly common with IG light chains [52, 53]. Finally, it is important to computationally filter nonproductive sequences for general analyses, if one is making claims about selected repertoires.

3.2 Gene Annotation

After preprocessing AIRR sequences for good-quality and relevant reads, sequences need to be accurately aligned and annotated to an appropriate reference germline database. This process identifies the V, D, and J genes; CDRs; and framework regions (FWRs) for each sequence in the repertoire. There are numerous annotation tools for IG and TR sequences that are freely available to users, including popular programs such as IgBLAST [10] and IMGT/HighV-QUEST (Table 1) [8]. Depending on the tools, different tool-specific algorithms (e.g., Smith-Waterman) assign the best match among a set of genes in a user-defined reference germline database. Accurate alignment is very important for subsequent analyses such as the identification of SHM for IGs, clustering of clonal groups, and determination of IG /TR diversity. Alignment algorithms have been demonstrated to influence the outcome of V, D, and J gene assignments, even when identical input sequences, tool parameters, and reference germline databases are chosen [31]. Furthermore, differences in the length of alleles of genes in databases may force algorithms to output an incorrect best match in the gene annotation process. To complicate matters, some tools provide alignments to multiple (often highly similar) genes and leave it to users to choose which of the ambiguous calls is most appropriate.

Schemes for IGs and TRs that number amino acid residues facilitate sequence comparisons, protein structure modelling, and engineering [54]. Although many schemes have been proposed and different schemes are employed by different tools, only five schemes are commonly used. Three are specifically for IGs: Kabat [55], Chothia [56], and enhanced Chothia [57]. Two more can be used for both IGs and TRs: IMGT [58] and AHo [59]. Conversion tables and tools like ANARCI [60] can be used to translate between schemes. CDR boundaries can differ substantially between different numbering schemes: care is needed when comparing results from different studies [54]. In repertoire studies, the IMGT numbering scheme is widely used and supported, and its use is recommended in the absence of other considerations.

One more barrier to direct comparison is the identification in some studies and tools of the “junction” and in others of the CDR3 . In IMGT terminology, the junction includes the second conserved cysteine of the V gene and the conserved tryptophan or phenylalanine of the J gene, while the CDR3 omits these residues. The AIRR Community data representation standard uses “junction”; however, it is not universally accepted [31].

Accurate annotation requires an accurate and comprehensive germline database. As noted above, even the currently available human database does not as yet meet this criterion [15, 61], and databases for other species are often partial and based solely on the analysis of a single animal [36, 62,63,64,65]. Fortunately, scientific need has resulted in the determination of new germline gene sets [36, 40, 66, 67], but these are not necessarily implemented by public germline gene databases in a timely fashion. The impact of missing or incorrect information in the database will depend upon the nature of the analysis, but one overall point to note is that the databases are updated frequently, and changes in the database can impact results [31]. It is therefore important that an analysis is conducted using a single, consistent, and up-to-date version of the database and that the version (or download date) is recorded for reproducibility. Germline databases are sometimes installed automatically with annotation tools: where that is the case, researchers should check if the installed version meets these requirements, and update it if necessary.

In a repertoire from a single individual, although structural variation and gene duplication give rise to frequent exceptions, we would expect to see a maximum of two alleles of most germline receptor genes: one from the paternal and one from the maternal chromosome. When used with an extensive germline database, annotation tools that are based on sequence similarity tend to call a biologically implausible number of alleles in B-cell repertoires, particularly in repertoires that are highly mutated, and will make a large number of indeterminate calls, where the tool would be unable to determine the likely germline allele unambiguously. Tools are available that will improve allele calls by using probabilistic methods to infer the individual’s “personalized” germline set: such tools can also infer the presence of alleles in the individual that were not listed in the annotation tool’s germline database [15,16,17, 68, 69]. While the use of a comprehensive germline database is important in the first instance, the determination of a personalized germline set and re-annotation with just that set is recommended where allele assignment is important: for example, when clonal inference is employed: personalization can also compensate to some extent for deficiencies in the germline database.

The decision of which annotation tool to use is also dependent on the computer skill set of the user. IMGT/HIGHV-QUEST and IgBLAST provide easy-to-use web platforms, suited for researchers that prefer to access a graphic user interface. Other tools, such as the stand-alone version of IgBLAST [10], MiXCR [1], and partis [11], require additional computer expertise, because they need to be installed and are used in the terminal. The advantage of such tools is that they provide more flexibility and can be integrated in automated workflows.

4 Conclusion

In this chapter, we present important considerations involved in the first steps in the preparation of raw data after sequencing and guide bioinformaticians in choosing the appropriate parameters for preprocessing and annotation. These first steps are required for the subsequent repertoire analysis, described in the Chap. 17, as choices made in these first steps have serious implications for the types of data analyses that can be performed and for the accuracy of the results. After the completion of this chapter, the bioinformatician is now ready to begin the in-depth analysis of repertoire features specific to the question at hand.