Assembly of an in silico decoy chromosome
We assembled an in silico chromosome that encodes difficult and clinically important regions of the human genome represented by sequins. This artificial decoy chromosome sequence (termed chrQ) is designed to accompany the reference human genome (such as hg38) during indexing and alignment and encodes the genetic features represented by sequins in a single contiguous sequence. The in silico chromosome is approximately 1.7 Mb in length and is organized into four main functional parts, including small variants, structural variants, HLA genes, and T and B cell immune receptor genes (Fig. 1a; see Additional files 1 and 2).
The first section of the chromosome encodes a range of synthetic variants, including SNPs and indels (n = 1353) associated with repetitive sequences and GC-rich/poor regions, and clinically relevant microsatellites (n = 12). In addition, the sequins representing germline variation are produced in pairs (reference and variant; n = 24 pairs) to emulate the diploid alleles of the human genome. This enables the evaluation of phasing methods to correctly resolve broad haplotype blocks sampled from each human chromosome (chr1-22, X and Y; average length = 5.9Kb). The second section encodes a range of large structural variants, including deletions, insertions, duplications, inversions (including mobile element insertions), and translocations (n = 45). The third section encodes a range of alternative HLA alleles (n = 8), while the final fourth section encodes synthetic T and B cell receptor loci that have undergone V(D)J recombination (n = 20). Together, the in silico chromosome serves as a ground-truth reference sequence that encodes a wide range of difficult and clinically important features selected from the human genome (see Additional file 3: Table S1).
Genetic reference and variant standards represented within the in silico chromosome were first synthesized as DNA fragments (average length = 2 kb) by a commercial vendor and validated using Sanger sequencing (see Additional file 4). These sequins were then mixed at different concentrations to emulate different allele frequencies, including germline homozygous and heterozygous genotypes, but also somatic allele frequencies (Fig. 1b; see Additional file 5). This final mixture was then sequenced alone or added at a fractional concentration (~ 2%) alongside the reference human genomic DNA using a range of library preparation or sequencing technologies (NA12878; Fig. 1b).
Study evaluation of genome technologies
Sequins provide a universal reference material to benchmark the performance of different genome technologies. We identified key variables that are known to impact performance in sequencing experiments, including base-calling accuracy, read length, or the use of PCR amplification during library preparation. We then designed experiments based on alternative preparation methods and sequencing instruments to include these key variables and evaluate the use of the in silico chromosome in diverse experimental settings. For example, we selected library preparation methods that differ in their fragmentation strategy and use of PCR amplification which can add further errors and bias during library preparation (LSK110 kit, KAPA HyperPlus PCR-based/PCR-free kits, and MGIEasy PCR-free; see the “Methods” section; Fig. 1a). We also considered different sequencing instruments that vary in terms of cost, read length, error rate, and throughput, such as short-read (including Illumina HiSeq 2500™, HiSeq X Ten™, NextSeq™, and BGI MGISEQ-2000™) and long-read (Oxford Nanopore Technologies PromethION™) sequencing technologies (Fig. 1b). Furthermore, we also prepared the standards mixture neat, without any accompanying genomic DNA, with the same preparation kit (KAPA HyperPlus PCR-based), but sequenced in different instruments (HiSeq 2500 and NextSeq) to evaluate any instrument-specific biases (2 replicates each). Following sequencing, reads were aligned to the combined reference genome (comprising both hg38 and chrQ, see the “Methods” section), and we then employed a range of different bioinformatic tools to evaluate the alternative analytical strategies that are used to resolve difficult regions of the in silico chromosome (Fig. 1a; see Additional file 6: Table S2).
Comparison between NA12878 genome and corresponding sequins
To initially validate the sequins, we first compared their sequencing performance to high-confidence regions and variants within the accompanying NA12878 genome sample. We showed that alignment coverage and distribution match closely between NA12878 and the accompanying standards (RMSE; Illumina = 0.24, MGI = 0.30, ONT = 0.18; see Fig. S1a). We next found that the sequencing mismatch error was also similar between sequins and corresponding human genome regions (RMSE; Illumina = 0.47, MGI = 0.48, ONT = 0.24; see Fig. S1b). The standards also reproduced errors and biases observed at more complex variants, such as large deletions that have been characterized with high confidence for NA12878 (see Fig. S2a). The commutability between sequins with NA12878 supports their use in characterizing sequencing performance in low-confidence regions and complex variants.
While the NA1878 genome and sequins exhibited similar performance in high-confidence regions and for simple variants, we found that the error profiles were different at genomic positions where NA12878 diverged from the reference genome. For example, within Illumina HiSeq libraries, the error frequencies at those divergent positions were higher in NA12878 alignments than corresponding sequins (single base mismatches: 3.1% in NA12878, 0.6% in sequins; insertions: 10.8% in NA12878, 0.3% in sequins; and deletions: 17.1% in NA12878, 0.3% in sequins). While sequins provide an unambiguous measurement of error at difficult sites, such as microsatellites or simple repeats, the measurement of error using the NA12878 genome is confounded by the presence of bonafide variants that cannot be reliably distinguished from sequencing errors (see Fig. S2b). This illustrates the value of sequins in providing an unambiguous representation of difficult regions. Accordingly, within the following sections, we use sequins to provide a detailed understanding of sequencing performance for these difficult regions and complex variants.
Sequencing errors at difficult or repetitive chromosomal regions
The depth and uniformity of alignment fold coverage are key variables in the detection of genetic variants. To first compare the alignment coverage of each library, we measured per-base normalized coverage across the in silico chromosome (Fig. 2a). We found that PCR-free library preparation (IQR = 0.35) and long-read sequencing (IQR = 0.30) strategies achieved the most homogenous coverage, as apparent by their lower interquartile range (IQR), while short-read PCR-based exhibited a more heterogenous coverage (IQR; MGISEQ-2000 = 0.38, HiSeq 2500 = 0.36, NextSeq 500 = 0.46; Fig. 2a).
To identify the source of this variability in alignment coverage, we undertook a closer analysis of alignments at difficult regions of low (< 30%) or high (> 65%) GC content (Fig. 2b; see Fig. S1c). PCR-free library preparation and long-read sequencing, which achieved a globally homogenous coverage, were little impacted by GC-rich/poor regions. However, among the other technologies, there was a reduction in coverage at low GC regions for MGISEQ-2000 (43.84% relative to mean global coverage) and HiSeq 2500 (51.49%) (Fig. 2b; see Fig. S1c). At high GC regions, the PCR-free and HiSeq 2500 libraries exhibited an increase in coverage with GC content (31.96% and 52.67%, respectively), while the MGISEQ-2000 and NextSeq 500 libraries exhibited a reduction in coverage as the GC content increased (32.08% and 28.14%, respectively; Fig. 2b; see Fig. S1c). These same libraries also showed a reduction in alignment coverage at simple repeats; however, this was less pronounced than at GC-rich/poor regions (Fig. 2c).
We next used the in silico chromosome as a ground-truth reference to measure the sequencing errors of each technology. As expected, the ONT long-read sequencing suffered from substantially higher error than other technologies (0.030 mismatches/kb), and among short-read instruments, the HiSeq 2500 achieved the most accurate reads (0.0018 mismatches/kb), compared to MGISEQ-2000 (0.0084 mismatches/kb; Fig. 2a). The relative frequency of transition and transversion errors also varied between instruments. For example, transition errors were higher for ONT (63.4%) and lower for NextSeq 500 (14.4%) compared to other libraries (overall mean; transitions = 32.9% ± 15.5; Fig. 2a). Accordingly, we generated detailed sequencing error profiles for different technologies that can provide a background against which to correct mutational signatures, especially for low-frequency somatic variants (see Fig. S1d).
We next considered the impact of sequencing errors and coverage on the detection of somatic variants. For each library, we evaluated the frequency of erroneous false-positive variants that otherwise impose a lower limit on the accurate detection of low-frequency mutations. Among short-read libraries, PCR-free library preparation achieved significantly lower false discovery rates (AUC = 0.0035) than corresponding PCR-based preparations (AUC; HiSeq 2500 = 0.035, NextSeq 500 = 0.039) or MGISEQ-2000 (AUC = 0.02). In contrast, the lower sequencing accuracy of long-read sequencing results in higher false discovery rates for somatic variants (AUC = 0.12). These results indicate how variation in key variables, such as coverage and sequencing error, at difficult genomic regions by different library preparation or sequencing instruments can limit the detection of clinically important features such as somatic mutations (Fig. 2d).
Resolution of genetic variation at low-complexity regions, including microsatellites
DNA replication of simple repeat sequences is difficult, resulting in the accumulation of mutations at these sites which, as a result, are highly polymorphic in human populations [18,19,20]. These simple repeats are also a challenge to sequence accurately as these technical sequencing errors can be difficult to distinguish from the biological genetic variants (Fig. 3a). Therefore, we next evaluated the detection of insertion and deletion errors at small (≤ 5 nt), medium (6–15 nt), and large (> 15 nt) homopolymer sites in the in silico chromosome across different genome technologies (see Fig. S3a).
To evaluate the performance of sequencing repeats, we first compared the fraction of reads with correct or erroneous repeat length within each library. We found that erroneous deletions (HiSeq X Ten/PCR-free = 3.14 × 10−5; MGISEQ-2000 = 6.98 × 10−5; ONT PromethION = 5.5 × 10−2; Fig. 3b) are more common at homopolymer sites than insertions (HiSeq X Ten/PCR-free = 6.29 × 10−6; MGISEQ-2000 = 2.13 × 10−5; ONT PromethION = 1.1 × 10−2; Fig. 3b). Furthermore, the frequency of both error types increases with homopolymer length up until ~ 15 nt, beyond which the error rates remain constant for larger repeats (mean Pearson’s correlation; deletions = 0.85 and insertions = 0.40; see Figs. S3b, c). However, for ONT PromethION, the rate of insertions decreases with homopolymer length (Pearson’s correlation = −0.97; see Fig. S3c). Notably, we also observed substantial performance differences due to the library preparation method and sequencing technology (Kruskal-Wallis test; H(7) = 94.54, p-value ≤ 0.0001, N = 31, Fig. 3b).
The difficulty in sequencing homopolymers with ONT PromethION is well-established, and only a minority (~ 6%) of sequenced long reads exhibited the correct length of homopolymers (see Figs. S3d, e). In contrast, among short-read libraries, PCR-free preparation significantly reduces erroneous deletion rates across all homopolymer lengths and insertion rates at small homopolymers (Fig. 3b; see Fig. S3f). Reads from PCR-free libraries also exhibit a higher proportion of exact matches (~ 77.6%) for observed homopolymer lengths compared to the other libraries (61.7%), with MGISEQ-2000 exhibiting comparable deletion rates to PCR-free libraries at small homopolymers (Fig. 3b; see Figs. S3e, f).
Microsatellites are highly polymorphic short repeat sequences that are interspersed throughout the human genome, and are commonly used as markers in forensics and genealogy, as well as for the detection of deficient DNA mismatch repair in human diseases [21]. We designed 7 stable synthetic microsatellites (NR27, NR24, NR22, NR21, MONO27, D18S55, and CAT25) and 5 unstable (BAT-25 and BAT-26 and three dinucleotide loci D2S123, D5S346, and D17S250) microsatellites from the Bethesda panel (Fig. 3a) [17]. At stable microsatellites, reads should exactly match the expected microsatellite length, while reads at unstable microsatellites should vary from the expected microsatellite length (Fig. 3a; see Fig. S3g). Again, PCR-free preparation achieved the best accuracy for most stable microsatellites (82.0%); however, performance varied across the instruments (HiSeq 2500 = 60.0%, NextSeq 500 = 45.0%, MGISEQ-2000 = 59.0%), with each technology exhibiting distinct biases. Finally, ONT long reads were largely unable to accurately resolve almost any microsatellites (6%; Fig. 3c).
In summary, we found that ONT is not suitable for the analysis of simple repeats due to high error rates, and, among short-red libraries, PCR amplification reduced accuracy substantially. The use of rolling circle amplification (within MGISEQ-2000 preparation), which employs the original copy of the DNA as a template during each amplification round, exhibits better performance at small homopolymers, but remains susceptible to insertion/deletion errors at larger repeats, such as microsatellites. In summary, the exclusion of amplification in PCR-free preparation methods achieved the best performance and is likely required for the accurate detection of microsatellite instability.
Resolution of synthetic structural variants with next-generation sequencing
Structural variants (SVs) involve the rearrangement of large chromosomal regions and can be difficult to resolve using next-generation sequencing, and the annotation of current genome references being largely restricted to insertions and deletions in high-confidence regions of the human genome [22]. Therefore, we designed a set of sequins that represented insertions (n = 6) and deletions (n = 10), but also inversions (n = 10), duplications (n = 11), viral insertions (n = 9), and reciprocal translocations (n = 8) that can benchmark the precision of structural variant detection (see Additional file 8: Table S3). To evaluate the detection of synthetic SVs, we used different software for short-read (Lumpy [23] and Manta [24]) and long-read libraries (CuteSV [25] and Sniffles [1]). The performance was evaluated according to the correct identification of the SV and the accuracy of breakpoint detection.
We first measured the performance across library preparation/sequencing technologies by aggregating the results from different structural variant callers (see Fig. S4a). We found the depth of coverage impacted sensitivity, with short-read libraries achieving better performance compared to long-read libraries when considering all the different SV types (see Fig. S4b). Similarly, among PCR-based libraries, we observe a difference between instruments, with HiSeq 2500 performing better than NextSeq 500 at higher coverage (two-sample Wilcoxon test; p-value ≤ 0.01; see Fig. S4a), while both performed equally poorly at lower coverage (see Fig. S4b).
We next evaluated the breakpoint detection achieved by the different SV software tools. For long-read ONT sequencing, which is able to align across large variants, CuteSV and Sniffles achieved similar overall precision (AUC; CuteSV = 0.64, Sniffles = 0.68; see Fig. S4e); however, Sniffles had better overall sensitivity (AUC; CuteSV = 0.31, Sniffles = 0.39; see Fig. S4d). The precision of breakpoint detection was high across all library preparation/sequencing technologies, with an average of 97.92% for short-read libraries, while ONT long-read sequencing correctly detected most breakpoints (86.77%) within 5 nt of the original position (see Fig. S4c).
We also used the synthetic structural variants to evaluate popular bioinformatic tools that identify SVs from short-read libraries. We assessed the sensitivity of these tools at varying alignment fold coverage, finding Lumpy and Manta achieved similar sensitivities (relative to fold coverage) across the libraries (AUC; Lumpy = 0.52, Manta = 0.51); however, Manta exhibited greater precision (AUC; Lumpy = 0.84, Manta = 0.93). Both software use split-read and discordant read-pair evidence; however, while Lumpy also includes read-depth into a probabilistic framework [23], Manta first assembles a graph of all break-end associations [24]. A direct comparison of the supporting evidence for individual SV calls showed Manta recovered a greater number of split-reads and discordant read-pairs that may account for the higher observed precision (see Figs. S5a, b).
We next investigated the ability to detect different structural variant types. Deletions, inversions, and reciprocal translocations were widely detected by the different libraries and bioinformatic tools (mean sensitivity; DEL = 0.67 ± 0.24, INV = 0.67 ± 0.26, TRA = 0.57 ± 0.16; Fig. 4b; see Fig. S5c). Notably, deletions and inversions had better detection among short-read libraries, while ONT long reads achieved better sensitivity at detecting translocations (Fig. 4a; see Fig. S5d). In contrast, duplications and insertions were more challenging to detect, with long reads performing slightly better especially as the depth of coverage decreased (mean sensitivity; DUP = 0.35 ± 0.21, INS = 0.22 ± 0.26; see Fig. S5c). Overall, insertions performed poorly among PCR-based methods, while duplications had a low sensitivity particularly with NextSeq 500 libraries (Fig. 4b). Notably, the performance also varied according to SV length, with ONT long-read sequencing failing to detect longer insertions that were otherwise detected within short-read libraries (see Figs. S5e, f).
Together, these results highlight the difficulty associated with SV calling, and the pervasive impact of library preparation, sequencing technology, and software on analysis. Among short-read libraries, PCR-based methods impair SV detection, while long-read sequencing can provide alignments that span large chromosomal rearrangements, and thereby resolve complex structural variants. However, all methods exhibit variable performance across the diversity of SV types. As a result, no single approach achieved comprehensive SV detection, and instead, a combination of approaches was required to identify the range of ground-truth synthetic SVs.
Phasing genetic variants into haplotype blocks
The phasing of alleles enables genetic variants to be linked to paternal or maternal human chromosomes [26]. However, phasing can be difficult at regions with sparse variants and be limited by fragment size and read length. To evaluate the phasing accuracy achieved by different library preparation and sequencing technologies, we designed 22 pairs of sequins that each represent maternal and paternal alleles for large regions (~ 6 kb) of each human chromosome, as well as chromosomes X and Y. Each control pair includes allele-specific common genetic variation and forms a diploid representation of human chromosomes (see Fig. S6a). To phase these synthetic alleles within NGS libraries, we used WhatsHap [27] and evaluated performance according to the fraction of correctly phased variants, and the length of correctly resolved haplotype blocks.
The initial inspection of read alignments reveals clear differences in phased haplotype blocks between short- and long-read libraries. For example, phasing synthetic heterozygous variants on chromosome 20 revealed progressively longer haplotype blocks for ONT, HiSeq X Ten/PCR-free, MGISEQ-2000, and HiSeq 2500/PCR-based (Fig. 5a). Indeed, ONT achieved significantly longer blocks (see Fig. S6b) overall compared to all other technologies. The average read length for ONT was 755.3 nt (SD = 831.6; see Fig. S6c), which was limited by the length of sequins (~ 2 kb on average; see the “Methods” section), and long-read technology was capable of consistently phasing distant variants (> 1000 nt apart; see Fig. S6d) that cannot be otherwise phased with short-read libraries (Fig. 5c). These longer haplotypes generated by long-read ONT sequencing exhibited slightly lower sensitivity (long-read = 0.93, short-read = 0.98 ± 0.01), but also a lower proportion of false-positive variants compared to short-read PCR-based methods (long-read = 3.03%, average short-read/PCR-based = 7.67%; see Fig. S6e).
Within short-read libraries, PCR-free preparation achieved longer haplotype blocks than alternative PCR-based methods (pairwise Mann-Whitney-Wilcoxon test; Mann-Whitney-Wilcoxon test; p-value ≤ 0.001; Fig. 5b; see Fig. S6b). Given that phasing accuracy is a function of the pairwise distance between variants, we found that this advantage was most apparent in non-polymorphic genome regions, where variants are sparsely distributed (Fig. 5c; see Figs. S6f, g). This advantage was supported by the distribution of DNA insert size which showed that PCR-based libraries had smaller DNA fragments (HiSeq 2500 = 168.83, NextSeq 500 = 195.16) than PCR-free or MGISEQ-2000 libraries (HiSeq X Ten = 319.82, MGISEQ-2000 = 289.63; Fig. 5d). Indeed, while the DNA fragment size distribution of these approaches had similar medians (HiSeq X Ten = 308, MGISEQ-2000 = 287), there was a subset of longer fragments (19.79%) in the PCR-free library which enabled phasing of more distant variants (Fig. 5a, d).
Together, this analysis illustrates the importance of longer DNA fragment size and read length, as well as variant density, required to achieve successful phasing. Furthermore, the gap between shorter haplotypes with more accurate variant detection provided by short-read sequencing and less accurate but longer haplotypes provided by long-read ONT sequencing continues to close. Optimally, a combination of these two sequencing technologies should produce longer, but more accurate phased haplotypes.
Impact of sequencing accuracy and coverage on HLA typing
The recognition of non-self-antigens by the immune system is mediated through the major histocompatibility complex (MHC) which is encoded within a 3.6-Mb region on chromosome 6. Due to selective pressures, this is one of the most polymorphic loci in the human genome, and variation of the human leukocyte antigen (HLA) genes is associated with disease [28, 29]. The accurate and rapid resolution of HLA genes is also required for successful donor-patient matching in organ transplantation. However, due to the complexity and hypervariability of this region, the accurate typing of HLA genes remains difficult with NGS [30].
To evaluate the use of NGS to perform accurate HLA typing, we incorporated a synthetic MHC region within the in silico chromosome that was accompanied by sequins representing HLA-A, HLA-B, HLA-C, and HLA-DQB1 alleles (Fig. 6a). We first inspected alignment accuracy at the reference HLA genes on the in silico chromosome (Fig. 6a). We found that short-read libraries closely matched the sequins, while ONT long-read sequencing, which exhibits an elevated sequencing error rate (mean 5-fold; see Fig. S7a), performed comparably to other technologies at the consensus level, with no errors observed within exons 2 and 3 of HLA-C and HLA-B (see Fig. S7b).
We then focused on typing the G-group exons (2 and 3) using HLA-LA [31] at varying fold coverage. At the antigen level (where different alleles expressing the same epitopes are grouped together), all libraries achieved accurate typing of the HLA alleles (see Figs. S7c, d). However, at the allele-group resolution level, which is a more specific standard to evaluate HLA typing, we observed variation in library performance that was largely dependent on coverage depth (see Fig. S7e). At 25-fold coverage or higher, HLA-typing accuracy is comparable among all short-read libraries, while PCR-free and NextSeq 500 libraries achieved the best sensitivity at lower coverage (down to 5-fold) due to increased coverage at the target exons used for HLA typing (see Figs. S7c, e).
ONT sequencing provides rapid, real-time sequencing that can rapidly match donor transplants to a recipient host during surgery and has accordingly been of considerable interest for HLA typing [32]. We found that at high-depth coverage, ONT genotyping performs comparably with other short-read sequencing approaches. However, at coverages lower than 25-fold, we observed that ONT consistently misdiagnosed allele groups, and the lower base-calling accuracy of ONT confounds accurate determination of genotypes (Fig. S7c).
Clonotype repertoire analysis of synthetic immune receptors
T cell receptors (TCR) and B cell receptors/immunoglobulins (BCR/Ig) recognize diverse foreign antigens as part of an effective adaptive immune response. These TCR and BCR genes undergo stochastic V(D)J recombination to generate a massive combinatorial diversity of receptor sequences that is further increased by random nucleotide excision and addition at rearranged junctions [33]. NGS is being increasingly used to profile this repertoire of TCR and BCR clonotypes and measure T- and B cell dynamics in healthy individuals, patients with cancer, infections, and autoimmune diseases. However, these diverse applications have different requirements for sensitivity, specificity, and quantitative accuracy that are impacted by sequencing and analytical errors [34, 35].
To evaluate the genome DNA-based profiling of TCR and BCR genes using NGS, we developed a synthetic immune repertoire containing both non-rearranged and rearranged IGL, IGK, TRG, TRD, and TRB clonotypes (Fig. 6b). These synthetic clonotypes represent TCR and BCR sequences that were derived from patient samples [36]. To further evaluate the quantitative accuracy of profiling techniques, we also mixed the sequins at different concentrations to form a quantitative ladder that encompasses different clonotype frequencies.
We first measured the sequencing accuracy of CDR3 sequences from rearranged TCRs and BCRs standards. Among short reads, the HiSeq 2500 (0.28 errors/100 nt) and NextSeq 500 (0.41 errors/100 nt) accumulated less errors than other instruments at CDR3 regions (HiSeq X Ten = 0.59 errors/100 nt, MGISEQ-2000 = 0.98 errors/100 nt; see Fig. S8a), while, as expected, the greater error rate for ONT long-read sequencing (9.06 errors/100 nt) was insufficient for the unambiguous resolution of CDR3 sequences.
We found that the PCR-free method demonstrated the best quantification of clonotype abundances (RMSE = 0.09; R2 = 0.95), while NextSeq 500 (RMSE = 0.15; R2 = 0.64) and ONT (RMSE = 0.14; R2 = 0.8) measured the clonotype frequencies poorly, especially at higher fractions (Fig. 6c). Notably, the ability to accurately resolve CDR3 sequences was directly correlated with clonotype frequency, with low-frequency clonotypes proportionately accumulating more errors (see Fig. S8a).
We next evaluated the accuracy by which rearranged TCR and BCR clonotypes were resolved using MiXCR [37]. Among short-read libraries, we observed high sensitivity for measuring clonotypes (varying between 78.9 and 89.4%), with most missed clonotypes having low expected frequency (see Fig. S8b). While CDR3 sequence was correctly identified, the recombined V(D)J segments, such as TRBV12-4/TRBD2/TRBJ2-2, were erroneously classified as TRBV12-4/TRBD1/TRBJ2-2 across all short-read libraries, indicating the presence of systemic false-positive artifacts. Nevertheless, despite these errors, we generally observed high precision (varying between 85.0 and 94.4%), with most false-positive segments resulting from clonotypes that were identified at low frequency (mean false-positive frequency of 0.0038; see Fig. S8c).
While the sequins do not encompass the scope and complexity of natural immune repertoires, they do provide a ground-truth standardized reference that can identify systematic biases, even when shared by all technologies. This construction of a synthetic repertoire provides a useful reference for the standardization of immune repertoire profiling by the research community [38, 39].