Introduction

A hallmark of adaptive immunity is the generation of a potent and specific response to a diverse array of pathogens, facilitated by immune molecules called immunoglobulin (Ig). When bound to an antigen, Igs activate the immune system’s downstream responses, leading to neutralization and removal of the target1. Igs are classified into five isotypes: IgA, IgD, IgE, IgG, and IgM, providing them with distinct characteristics and functions2. The structure of Igs comprises the heavy chain (VH) and the light chain (VL), with each chain composed of multiple gene segments spliced together in a process called V(D)J recombination2,3. This stochastic process creates an enormous theoretical diversity of recombined unique Ig sequences estimated to be at least 10134,5 in humans. The collection of all unique Igs within an individual is referred to as the Ig repertoire.

To investigate the genomic diversity of the human Ig repertoire, high-throughput B-cell receptor (BCR) sequencing (BCR-seq) has now become the standard method4,6,7,8. The use of BCR repertoire profiling has helped researchers gain new insights into the nature of immune protection, with important implications for the understanding of health and disease9,10,11. BCR repertoire profiling usually entails the quantification of repertoire features such as clonal distribution and expansion, germline gene usage, and clonal sequence overlap between repertoires4,9,12,13,14. Alterations of repertoire features have been detected in infectious and autoimmune diseases15,16,17, as well as cancer18.

Currently, there are two main high-throughput approaches to sequencing the BCR repertoire, differing by scale and resolution: bulk (bulkBCR-seq) and single-cell (scBCR-seq) sequencing. BulkBCR-seq so far provides the highest sampling depth for the purpose of covering the diversity of the immune repertoire, with a wide array of established methods19,20,21,22. On the contrary, most currently available scBCR-seq methods have 100–1000 times lower sampling depth compared to bulkBCR-seq23, albeit novel methods are currently being developed to improve throughput. However, scBCR-seq possesses the capability to recover the native pairing between the heavy chain and the light chain. While bulkBCR-seq libraries can extract BCR information from 105 to 109 cells, scBCR-seq libraries currently limit input to 103–105 cells due to technology constraints24. This throughput gap is biologically important because lower repertoire coverage results in less information captured. Specifically, since the functions of the Ig repertoire are derived from their diversity3,25,26,27,28,29, capturing immune repertoire diversity is fundamental to uncovering its functions. The higher sampling depth and relative simplicity of library preparation make bulkBCR-seq suitable for abundant and easily accessible B-cell samples such as those isolated from peripheral blood. Conversely, scBCR-seq is more suited for characterizing B-cell subsets that are difficult to obtain and limited in numbers, including cells isolated from lymphatic tissues. So far, only a few studies have employed both approaches to examine the BCR repertoire at different scales30,31, although their joint analysis is necessary for a comprehensive understanding of the genomic compartment of the adaptive immune system.

Despite the potential of BCR-seq to profile BCR diversity, it cannot be applied to characterize secreted antibodies since these molecules are proteins, and they cannot be directly examined on the nucleotide level. Not all BCRs that are expressed will become antibodies32, and the correlation between the abundance of BCRs and their antibody counterparts in the serum is unclear33,34. Therefore, it is necessary to determine antibody sequences directly on the proteomic level. As of now, there are no protein sequencing methods where the sequence can be read directly at high-throughput. To determine the sequence of antibodies, proteomics techniques such as liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) have been utilized34,35,36,37. Briefly, antibodies after purification by affinity chromatography are digested with proteases into short peptides, fractionated by liquid chromatography, and analyzed by mass spectrometry. The recorded mass spectra are then matched with a reference in silico spectra, created from genomic sequencing data, in order to determine the peptide sequence, hereby referred to as Ab-seq38. Since antibodies are so similar to each other yet so diverse, and the proportion of shared clones between individuals is very low28, it is more sensible to create a custom reference database from the same individual to ensure higher accuracy34,36,37,39,40. Novel methods have been developed where the peptide sequence can be determined de novo (without reference sequences), but these methods are not yet well-established for antibodies41,42,43,44,45,46,47,48. Therefore, the integration of Ab-seq with BCR sequencing technologies holds the promise of connecting the genomic and proteomic levels of the adaptive immune repertoire.

To address the scarcity of datasets where both bulkBCR-seq and scBCR-seq data are available, we performed BCR sequencing on a healthy adult donor at depth (bulkBCR-seq) and with chain pairing (scBCR-seq), with scBCR-seq sample replicates to examine the effect of sampling depth on commonly used repertoire features in benchmarking studies8,13. This was combined with computational pre-processing to account for differences in bulkBCR-seq and scBCR-seq data. Furthermore, to compare repertoire features across individuals, we also performed bulkBCR-seq and scBCR-seq on ten healthy adult donors. In addition to the data we generated, the processing pipeline was also applied to BCR sequencing data from a public dataset: six pediatric patients from a study conducted by King and colleagues30. Combining insights gained from the analysis of both own-generated and public data, we found that VH-gene frequencies are consistent within individuals across sequencing methods, while clonal sequence overlap is significantly affected by changes in sampling depth. Furthermore, with the recovery of clonal sequences from mass spectrometry peptides in Ab-seq, we demonstrated the potential of utilizing both bulkBCR-seq and scBCR-seq data as reference to examine the diversity of the serum antibody repertoire. Collectively, our results serve to demonstrate as proof-of-principle the feasibility of integrating complementary methods to interrogate the humoral immune system at multiple levels.

Results

Experimental design and dataset description

To determine the ability of BCR-seq and Ab-seq to capture information from the humoral immune system, we selected a number of repertoire features, which have been shown to encompass diverse immunological dimensions of the Ig repertoire14,49. These include (1) VH-gene usage (the frequency of all heavy chain V genes in a sample) (Fig. 1c), (2) shared CDRH3 amino acid sequences between samples (quantified by Jaccard similarity index) (Fig. 1d), (3) clonotype identification and V(D)J sequence reconstruction from Ab-seq peptides (based on BCR-seq reference libraries) (Fig. 1e).

Fig. 1: Experimental design for the comprehensive examination of the Ig repertoire at genomics and proteomics levels.
figure 1

a The sequence information of the Ig repertoire can be examined on the nucleotide level by bulk BCR sequencing (bulkBCR-seq) and single-cell BCR sequencing (scBCR-seq) or on the amino acid level by bottom-up antibody mass spectrometry (Ab-seq). However, there is a lack of joint analysis integrating and comparing the overlap between methods, and few studies where Ab-seq data augment BCR data. b To address this problem, we developed a workflow that extracts repertoire information by combining high-throughput bulkBCR-seq with natively paired receptor information in scBCR-seq, then leveraging this sequence information to examine the composition of the serum antibody repertoire using Ab-seq. Our findings indicate that: (c) VH-gene usage is conserved within individuals despite differences in sequencing methods, (d) the gap in sampling depth between bulkBCR-seq and scBCR-seq resulted in low clonal sequence overlap, contributing to lower biological coverage of the Ig repertoire, and (e) it is possible to recover clonal sequences from peptide sequences characterized by Ab-seq, with paired chain BCR sequencing contributing to paired chain V(D)J sequence reconstruction. This figure was generated in BioRender.

We utilized a total of three datasets in this study. In Dataset 1, we performed bulkBCR-seq and scBCR-seq of total B cells from a healthy adult donor’s peripheral blood (Fig. 2a). This allowed us to examine the effect of the sampling depth gap between bulkBCR-seq and scBCR-seq on capturing repertoire features, and the influences of technical replicates. In addition, personalized BCR references at high depth and paired chain allowed for recovery of clonal identity from Ab-seq peptides. Dataset 1 contained 36 samples of four isotypes: IgA, IgD, IgG, and IgM. The bulkBCR-seq samples had 20,942–195,417 unique CDRH3 aa sequences, while scBCR-seq samples had 45–5885 unique CDRH3 aa sequences (Supplementary Fig. 1a), with Sample 8 having the lowest number of CDRH3 sequences across all isotypes owing to suboptimal efficiency in single-cell encapsulation. Clonal expansion, measured by repertoire Evenness (see methods), was higher in bulkBCR-seq samples compared to scBCR-seq samples, and higher in IgG and IgA samples compared to IgD and IgM samples (Supplementary Fig. 2a). The number of unique VH genes in bulkBCR-seq samples was 39–42 genes, while in scBCR-seq samples it was 54–63 unique VH genes (Supplementary Fig. 3a). In addition to bulkBCR-seq and scBCR-seq, we performed Ab-seq by isolating antibodies of various isotypes (IgG, IgA, all κ chain antibodies, and IgM) from the serum, digested into peptides by proteases (Trypsin, Chymotrypsin, Chymotrypsin + Trypsin, and AspN), before peptide analysis with LC-MS/MS (Fig. 2a).

Fig. 2: Overview of the three datasets analyzed.
figure 2

To analyze and compare datasets we measured repertoire features (e.g., VH-gene usage, CDRH3 sharing) across samples generated using bulkBCR-seq, scBCR-seq, and Ab-seq, we utilized both own-generated (Dataset 1 and Dataset 2) and public data (Dataset 3). a In Dataset 1 (blue rectangle), total B cells were isolated from the peripheral blood of one healthy adult, then sequenced in bulk or encapsulated into single-cell droplets in 8 technical replicates before sequencing. In addition, antibodies were isolated from the serum of the same individual, digested with protease, and then the peptides were analyzed by LC-MS/MS. b In Dataset 2 (green rectangle), peripheral mononuclear blood cells (PBMC) were isolated from the peripheral blood of ten healthy adults, and B cells were then isolated from the PBMCs. The isolated B cells were either sequenced in bulk or encapsulated into single-cell droplets and then sequenced. c In Dataset 3 (red rectangle) (publicly available data from a study by King and colleagues30), tonsillar samples were obtained from six pediatric patients, B cells were stained and sorted by flow cytometry into cell subsets (Naive: naive B cells, GC: germinal center B cells, Bmem: memory B cells, PB: plasmablast) before bulk sequencing or single-cell encapsulation followed by single-cell sequencing. See also Supplementary Fig. 1 for unique CDRH3 sequence count, Supplementary Fig. 2 for repertoire Evenness, and Supplementary Fig. 3 for VH-gene usage distribution. This figure was generated in BioRender.

In Dataset 2, we performed bulkBCR-seq and scBCR-seq of total B cells from ten healthy adult donors (Fig. 2b). This dataset possessed uniform library preparation across multiple donors, allowing for comparison of the same repertoire feature across individuals. Dataset 2 contained 80 samples from ten donors of different isotypes (IgA, IgD, IgG, and IgM), with a unique CDRH3 aa sequence count range of 2899–223590 for bulkBCR-seq samples and 85–9360 unique aa sequences for scBCR-seq samples (Supplementary Fig. 1b). Repertoire evenness, similar to Dataset 1, was higher in bulkBCR-seq samples than in scBCR-seq samples, and with IgD samples being less expanded than other isotypes (Supplementary Fig. 2b). BulkBCR-seq samples had 54–65 VH genes, while scBCR-seq samples had 46–59 unique VH genes (Supplementary Fig. 3b).

For Dataset 3, we utilized published data from King and colleagues30, containing BCR-seq data from six pediatric patients (Fig. 2c). By utilizing flow cytometry-sorted B cells as starting material for sequencing libraries, Dataset 3 allowed the comparison of repertoire features across B-cell subsets. Dataset 3 contained 30 bulkBCR-seq samples containing BCR sequences of different B-cell subsets (unsorted, naive, germinal center, memory, and plasmablast) and 12 scBCR-seq samples from unsorted and memory B cells. Unique CDRH3 aa count was 4601–55,522 for bulkBCR-seq samples, and 76–3424 unique aa sequences for scBCR-seq samples (Supplementary Fig. 1c). Clonal expansion is similar across B-cell subsets in bulkBCR-seq samples except naive B cells, and in scBCR-seq samples, memory B cells are less expanded compared to unsorted B cells (Supplementary Fig. 2c). Unique VH gene count was 53–56 genes in bulkBCR-seq samples, and 47–51 genes in scBCR-seq samples (Supplementary Fig. 3c).

An individual’s VH-gene usage is captured by both bulkBCR-seq and scBCR-seq

The VH-gene usage profile, characterized by the count of germline VH genes and their corresponding frequencies within a sequenced repertoire, provides a basic overview of the immune repertoire diversity50,51,52,53. Therefore, we explored whether scBCR-seq, despite its reduced sampling depth, could adequately capture an individual’s VH-gene composition within their immune repertoire (Fig. 1a). Grouping sample pairs by donor and sequencing method, Dataset 2 (Fig. 2a) revealed a significantly higher consistency between samples from the same donor, whether the sequencing method was the same (n = 120, median Pearson correlation r = 0.90) or different (n = 160, r = 0.84) compared to samples from different donors of the same sequencing method (n = 1440, r = 0.78) and different sequencing methods (n = 1440, r = 0.75) (p < 0.05, Fig. 3a). Similarly, in Dataset 3 (Fig. 2b), VH-gene usage was more correlated between samples from the same donor and the same method (n = 66, r = 0.95) or different methods (n = 60, r = 0.87) compared to samples from different donors of the same method (n = 435, r = 0.81) or different methods (n = 300, r = 0.75) (p < 0.05, Fig. 3a). Hierarchical clustering of all VH-gene usage pairwise comparisons also showed clusters of higher similarity within individuals (Supplementary Fig. 5b, c).

Fig. 3: Both bulkBCR-seq and scBCR-seq capture an individual’s VH-gene usage profile.
figure 3

VH-gene usage profiles were constructed by counting the frequency of all VH genes in a sample without weighting by clonotype size. VH-gene usage similarity between samples was measured by Pearson correlation. a VH-gene usage Pearson correlation between samples by donor (same or different donors) and sequencing method (bulkBCR-seq or scBCR-seq). b VH-gene usage Pearson correlation between bulkBCR-seq samples by donor and B-cell subset (same or different B-cell subsets). c VH-gene usage Pearson correlation between bulkBCR-seq and scBCR-seq samples (cumulatively merged from one to eight technical replicates to increase sampling depth). Numbers displayed below each violin plot signify the median Pearson correlation value. Global differences between the Pearson correlation values were determined using the Kruskal-Wallis test, and pairwise differences were determined by the Wilcoxon Rank Sum test, with p values adjusted for multiple testing by Bonferroni correction. All adjusted p values lower than 0.05 are displayed above brackets. See also Supplementary Fig. 5 for VH-gene usage correlation values displayed as heatmaps with hierarchical clustering.

In Dataset 3, where B cells were sorted into different subsets, we asked whether VH-gene usage differed between subsets. BulkBCR-seq samples with the same B-cell subset but different donors (n = 75, r = 0.84) were more similar in VH-gene usage compared to samples of different donors and different subsets (n = 300, r = 0.81) (p < 0.05). However, the difference between those groups was small compared to the group of samples of the same donor but different subsets (n = 60, r = 0.95), suggesting that the individual’s impact on the germline VH-gene usage profile is larger (Fig. 3b). This was also reflected in the clustering of samples, where the main clusters consisted primarily of samples from the same individual (Supplementary Fig. 5c).

Delving further into the differences between sequencing methods, in Dataset 1, we compared the VH-gene usage distribution between bulkBCR-seq and scBCR-seq samples within the same individual (Fig. 2a). Similarity in VH-gene usage was the highest between bulkBCR-seq samples (n = 6, r = 0.95, blue dashed line), followed by scBCR-seq samples (n = 496, r = 0.87, red dashed line). Across methods, however, median Pearson correlation remained consistent (n = 16–128, r = 0.74–0.75) even as the scBCR-seq technical replicates were cumulatively merged to reduce the gap in sampling depth between methods. Hierarchical clustering also showed that between scBCR-seq replicates, samples differed mainly across different BCR isotypes, and that bulkBCR-seq samples cluster were largely separated from the scBCR-seq samples (Supplementary Fig. 5a). To summarize, in spite of differences between methods, both bulkBCR-seq and scBCR-seq are sufficient in characterizing an individual’s distinct VH-gene usage profile, and the individual is the largest determinant of VH-gene usage, which is in line with previously established findings in the field54.

BulkBCR-seq provides better coverage of BCR repertoire biological diversity as opposed to scBCR-seq

Given the aforementioned gap in sampling depth between bulkBCR-seq and scBCR-seq, we investigated whether the coverage of the BCR repertoire diversity was affected by scBCR-seq’s lower B-cell input (Fig. 1a). Utilizing pairwise CDRH3 Jaccard overlap as a measurement for repertoire coverage (see Methods), between bulkBCR-seq and scBCR-seq samples in Dataset 1, pairs of samples of the same isotype (n = 124, median Jaccard overlap index J = 0.00395) had a 10-fold higher overlap amount versus samples of different isotypes (n = 372, J = 0.00031) (p < 0.05, Fig. 4a). Specifically, for samples of the same isotype, CDRH3 overlap increased significantly when technical replicates of scBCR-seq samples were merged together cumulatively at 3 (n = 20, J = 0.0045), 4 (n = 20, J = 0.0052), and 5 (n = 16, J = 0.0062) replicates (p < 0.05), after which the increase in overlap amount became statistically insignificant compared to the original scBCR-seq samples (Fig. 4a).

Fig. 4: CDRH3 sequence overlap between bulkBCR-seq and scBCR-seq samples is higher within the same isotype, increased with higher sampling depth, and within the same individual.
figure 4

Pairwise CDRH3 amino acid sequence overlap was quantified using Jaccard overlap index (see Methods), with grouping of the Jaccard overlap index from the samples by isotypes, number of technical replicates merged, donors, and B-cell subsets. a CDRH3 sequences overlap of bulkBCR-seq versus scBCR-seq samples with the same or different isotypes; CDRH3 sequences overlap of bulkBCR-seq versus scBCR-seq samples of the same isotype, with scBCR-seq technical replicates cumulatively merged from one to eight replicates. b CDRH3 sequence overlap between samples from different donors or the same donor. c CDRH3 sequence overlap between samples of different B-cell subsets: unsorted B cells (Unsorted), naive B cells (Naive), germinal center B cells (GC), memory B cells (Bmem), plasmablasts (PB). Numbers displayed below each violin plot show the median Jaccard overlap value. Log10 scales were utilized on the y-axis when appropriate to enhance visual clarity. Global differences between the Jaccard overlap values were determined by the Kruskal-Wallis test and pairwise differences were determined by the Wilcoxon Rank Sum test, with p values adjusted for multiple testing by Bonferroni correction. All adjusted p values lower than 0.05 are displayed above brackets. See also Supplementary Fig. 6 for CDRH3 Jaccard overlap values displayed as heatmaps with hierarchical clustering.

Between samples grouped by donor, samples from different donors had low proportion of shared CDRH3 sequences, as reflected by low median Jaccard overlap in Dataset 2 (n = 2880, J = 0) and Dataset 3 (n = 735, J = 7 × 10−5), while samples from the same donor had significantly more shared CDRH3 sequences in Dataset 2 (n = 280, J = 2 × 10−4) and Dataset 3 (n = 126, J = 0.00193) (p < 0.05, Fig. 4b). This pattern was confirmed using Jaccard similarity index-based clustering analysis (Supplementary Fig. 6b,c), which revealed that clusters of higher clonal overlap were from the same individual.

In Dataset 3, comparing the CDRH3 Jaccard overlap of bulkBCR-seq samples from the same donor and different B-cell subsets (n = 6 in each group) showed that naive B cells had the lowest amount of shared clonal sequences with other subsets (J = 0.0008–0.0021), a 3–25-fold difference compared to the median overlap between memory B cells and germinal center B cells, plasmablast, and unsorted B cells (range: 0.0066–0.0099), and a 15–52-fold difference compared to the median overlap between unsorted B cells, germinal center B cells, and plasmablast (range: 0.00309–0.042) (Fig. 4c). However, since the number of samples in each group were small (n = 6) compared to the number of pairwise comparisons (n = 45), the differences were not statistically significant after application of multiple testing correction.

The phenomenon of “Light chain coherence (LCC)” was recently described by Jaffe and colleagues55. Briefly, it quantifies the probability of unrelated B cells with similar heavy chains also having similar light chains under certain criteria (see Methods). Here, we investigated whether we could observe LCC in our scBCR-seq datasets. In Dataset 1, the median LCC within an individual was similar when technical replicates were cumulatively merged (range: 88.31–93.06%, number of cells evaluated for LCC: 118–4299, number of cells in repertoires: 782–39,154) (Supplementary Fig. 7a). For Dataset 2, LCC within individuals was also high (range: 95.78–100%, number of cells evaluated for LCC: 64–2121, number of cells in repertoires: 3416–11,156), except for a single donor where there were no cell pairs meeting the criteria for evaluation (Supplementary Fig. 7b). In Dataset 3, regarding B-cell subsets, four out of six unsorted B cell samples did not have any cell pairs with the same VH gene, same CDRH3 sequence, and different clonotypes for LCC calculation, as opposed to memory B cells, where all six samples had LCC ranging from 80.3 to 100% (number of cells evaluated for LCC: 14–577, number of cells in repertoires: 189–3526) (Supplementary Fig. 7c). Across donors, however, LCC was found in only 2 out of 45 pairwise donor combinations in Dataset 2 and only 1 out of 30 pairwise donor and subsets combinations in Dataset 3 (Supplementary Fig. 7d, e). Our findings are in agreement with Jaffe and colleagues, where LCC was higher in memory B cells than in naive cells, but LCC cells in absolute numbers are low compared to the size of the repertoires55.

To summarize, the higher sampling depth provided by bulkBCR-seq allowed for better coverage of BCR sequence diversity, particularly in B-cell subsets that are highly diverse per se, such as naive B cells. For scBCR-seq, merging technical replicates can help compensate for the low coverage of the BCR repertoire and low clonal sequence overlap with bulkBCR-seq.

BCR-seq augmentation allows mapping of Ab-seq peptides to the CDR3

With the sequencing data generated in Dataset 1, we investigated if Ab-seq can be utilized to identify specific clonotypes from serum antibodies and potentially recover the full V(D)J sequences from short peptides (Fig. 1a). A clonotype is identified when an Ab-seq peptide uniquely maps to that clonotype BCR-seq reference sequence and overlaps at least 3 aa contiguously to its CDRH3 sequence. The bulkBCR-seq and scBCR-seq data were used to create reference databases for Ab-seq peptide alignment with 310495 heavy chain and 113272 light chain bulkBCR-seq V(D)J sequences, in addition to 47756 heavy chain and 31309 light chain scBCR-seq V(D)J sequences. We isolated serum antibodies of different isotypes: IgA, IgG, IgK (all κ chain antibodies), and IgM (Fig. 5a) and sequenced the antibodies using Ab-seq (see Methods). In total, 18311 peptides were recovered from serum antibodies after contaminants removal. Of those peptides, 10463 (57.1%) were aligned to the variable region of reference BCR sequences, and 887 (4.8%) peptides were overlapping at least three amino acids with the CDR3 of a reference BCR sequence (CDR3-overlapping) (Fig. 5b). Out of the CDR3-overlapping peptides, the distribution of peptide length mostly followed the distribution of CDR3 overlap length, with the majority of peptides ranged 7–20 aa and overlapped 3–12 residues with the reference’s CDR3. Only a few peptides were longer than 20 aa and overlapping less than 6 aa with the reference’s CDR3 (Fig. 5c).

Fig. 5: Only a small proportion of Ab-seq peptides overlap with the CDR3 region, and most BCR reference matches were from less expanded clonotypes in BCR-seq.
figure 5

Peptides from digestion of serum Abs were analyzed by LC-MS/MS, and their sequences were aligned to references made from bulkBCR-seq and scBCR-seq data of the same individual. a Sample setup for Ab-seq. b Number of peptides identified by LC-MS/MS (All), aligned to reference BCR sequences (Ab-specific), and overlapping at least 3 aa with the reference sequence’s CDR3 (CDR3-overlapping). c Ab-seq peptide length in regard to the length of overlap with its reference’s CDRH/L3. d Number of CDR3-overlapping peptides by the sequencing method of the reference match: bulkBCR-seq, scBCR-seq, or both bulkBCR-seq and scBCR-seq (both). e Number of CDR3-overlapping peptides by protease treatment: AspN, Chymotrypsin (Ct), Chymotrypsin followed by Trypsin (Ct+Tryp), and Trypsin (Tryp). f Number of CDR3-overlapping peptides that mapped to only one reference clonotype or multiple clonotypes. g Number of uniquely mapped CDR3-overlappping peptides by clonal size ranking in descending order of the BCR-seq reference match in log10 scale.

Separating the CDR3-overlapping peptides by the sequencing method of the reference matches, we saw that bulkBCR-seq contributed the majority of peptide matches (406 for heavy chain and 144 for light chain), while a smaller number of peptides matched to scBCR-seq references (35 for heavy chain and 14 for light chain). A substantial number of CDR3-overlapping peptides mapped to both bulkBCR-seq and scBCR-seq references (203 for heavy chain and 85 for light chain) (Fig. 5d). Grouping the CDR3-overlapping peptides by the proteases used in antibody digestion, Trypsin provided the highest number of peptides (322 for heavy chain and 99 for light chain), followed by AspN (260 for heavy chain and 83 for light chain), Chymotrypsin (33 for heavy chain and 42 for light chain), and Chymotrypsin + Trypsin (29 for heavy chain and 19 for light chain) (Fig. 5e). Out of 887 CDR3-overlapping peptides, only 309 (34.8%) peptides mapped uniquely to only one clonotype in the reference, while 578 peptides (65.2%) mapped to multiple reference clonotypes (Fig. 5f). From these uniquely mapped peptides, we compared the isotype of the BCR-seq reference with the isotype of the antibody input and found that the majority of IgD, IgM, and IgA clonotypes in the BCR-seq reference class-switched into IgG in the serum, with a smaller proportion of clonotypes switching to IgA (Supplementary Fig. 8a).

Displaying the distribution of ranking of the BCR reference clonotypes, i.e, a measure of abundance in the BCR repertoire, we found that almost all uniquely mapped CDR3-overlapping peptides aligned to clonotypes that were less expanded, with their clonal rank being higher than 1000 for bulkBCR-seq references. A similar trend was observed for scBCR-seq (Fig. 5e). In addition, the identified clonotypes were highly diverse, indicated by the Levenshtein (edit) distance between their CDR3 references, with the majority of sequence distance ranging from 6 to 20 on the heavy chain clonotypes and 3–11 on the light chain (Supplementary Fig. 8b).

From the uniquely mapped CDR3-overlapping peptides retrieved by mapping Ab-seq peptides to BCR-seq references, sequence information from the reference clonotype could be retrieved, including the V gene and J gene name, the CDR3 sequence, and the full V(D)J sequence of that clonotype. For bulkBCR-seq reference matches (212 peptides on the heavy chain and 65 peptides on the light chain), V(D)J sequence recovery is possible for the chain that the Ab-seq peptide mapped to (Fig. 6a, Supplementary File 1). For scBCR-seq matches (24 peptides on the heavy chain and 8 peptides on the light chain), from a uniquely mapped Ab-seq peptide, not only the corresponding V(D)J sequence on one chain, but also the V(D)J sequence of the paired chain of the same B cell can also be identified. This is possible due to the cell barcode added to all single-cell droplets during scBCR-seq library preparation, in addition to the LCC measured in Dataset 1 (88.31–93.06%, Supplementary Fig. 7) to take into account for rare instances where different cells having a similar heavy chain but different light chains and vice versa. This in turn allowed for proteomic paired chain sequence recovery for 32 antibodies (Fig. 6b, Supplementary File 1).

Fig. 6: Ab-seq peptides that map to a specific clonotype can be utilized to recover clonal information.
figure 6

Uniquely mapped CDR3-overlapping peptides (quantified in Fig. 5f) were utilized to recover the clonotype information provided by BCR-seq, with overlapping segment in bold. a Example of Ab-seq peptides mapped to a bulkBCR-seq used to recover the single chain V(D)J sequence. b Example of Ab-seq peptides mapped to a scBCR-seq used to recover the paired chain V(D)J sequence by utilizing the cell barcode unique to each single-cell droplet. See Supplementary File 1 for the full table of all V(D)J sequences recovered by Ab-seq.

In summary, recovery of V(D)J sequences directly from serum antibodies is plausible with Ab-seq augmented by BCR-seq references, although only a small fraction of mass-spectrometry peptides could be utilized for clonotype identification in Ab-seq, and the majority Ab-seq peptides mapped to less expanded clonotypes in the BCR repertoire. With scBCR-seq reference libraries in particular, paired chain V(D)J sequence reconstruction is also possible with reasonable confidence.

Discussion

Summary of main findings

In this study, we conducted a comprehensive examination of human BCR repertoire features from bulkBCR-seq and scBCR-seq data, taking into account factors such as sampling depth, the use of replicates, different donors, and different B-cell subsets. The results led us to conclude that both bulkBCR-seq and scBCR-seq are suitable to characterize an individual’s VH-gene usage profile, even though the sequencing methods contributed to variations in the results (Fig. 3, Supplementary Fig. 5). However, with respect to measuring the overlap between repertoires, the lower sampling depths of scBCR-seq resulted in lower CDRH3 sequence overlap compared to bulkBCR-seq (Fig. 4, Supplementary Fig. 6). Utilizing personalized BCR-seq libraries as references for Ab-seq, we demonstrated the feasibility of reconstructing the full V(D)J sequences from short peptides of circulating antibodies in the serum (Figs. 5, 6 and Supplementary Fig. 8), establishing a connection between the genomic and proteomic diversity of the Ig repertoire.

Clonal abundance and its effects on repertoire features

For the purpose of comparing bulkBCR-seq and scBCR-seq samples in this study, we selected metrics that did not take clonal abundance into account to characterize repertoire features. This is because some repertoire feature metrics are weighted by clonal proportions, and thus are affected by different clonal abundances. For bulkBCR-seq, clonal abundance is measured through the use of UMIs, which count the number of RNA molecules present in the sample56,57,58. Thus, clonal count based on UMIs can be distorted by the different levels of BCR expression for different B-cell subsets, particularly for plasmablasts59,60. On the other hand, scBCR-seq quantifies clonal abundance through cell barcodes, which better reflects the true cellular abundance of a biological clone61.

Impact of sequencing library preparation on profiling VH-gene usage

While there have been studies where both bulkBCR-seq and scBCR-seq were utilized to characterize the BCR repertoire for separate research purposes30,31, we processed the same B-cell samples with both sequencing methods using the same pre-processing pipeline, and quantified the effect of sequencing method on the capturing of repertoire features (Figs. 3, 4). From the results, we speculate that the different chemistries utilized by bulkBCR-seq and scBCR-seq library preparation could potentially affect the VH-gene usage profile of samples. For bulkBCR-seq libraries, 5’ multiplex PCR was used, based on a set of primers designed to capture all known human V genes. This allowed efficient full-length BCR sequencing while keeping the amplicon length under 600 bp for Illumina MiSeq21,62. However, one of the drawbacks of this method is that some genes might be preferentially amplified over others, leading to incorrect representation of V gene distribution19,63,64, corroborated by the lower number of unique VH genes identified in the bulkBCR-seq samples in Dataset 1 compared to their scBCR-seq counterparts (Supplementary Fig. 3a). On the other hand, the scBCR-seq libraries utilized 5’ template-switching PCR to amplify the V(D)J region, which relied on the addition of a template-switching handle for amplification, independent of the V-gene sequence. This, theoretically, allows for more unbiased capturing of BCRs, but suffers from the low efficiency of template-switching and creating longer amplicons21,65. In addition, 10X Genomics single-cell V(D)J profiling utilized restriction digestion to break the V(D)J amplicons into shorter pieces, and the contig sequence was reconstructed during pre-processing. This results in the need to impute the V(D)J region rather than characterizing it directly like in bulkBCR-seq libraries. As a result, we did not discriminate VH genes at the allele level because the scBCR-seq libraries were not full-length and required imputation from the reference database to obtain the full V(D)J sequences. Allele information may become crucial in the future as recent efforts in BCR repertoire profiling demonstrated the effect of germline gene variations on the ability to generate protective responses against pathogens66,67, leading to novel approaches in vaccine design, therapeutic antibody discovery, and disease diagnostics68.

Sampling depth in relation to the coverage of repertoire diversity

As previously mentioned, the diversity of the immune repertoire is fundamental to its functions. With the potential diversity of the BCRs in humans estimated to be >1013, there is a need to cover the largest extent of biological diversity possible. To this end, we characterized repertoire coverage by means of shared CDRH3 sequences between samples (Fig. 4, Supplementary Fig. 6). We found that there is very little overlap of CDRH3 sequences between individuals, in line with previous research27,28,69,70. One noticeable result, however, is that the proportion of shared CDRH3 sequences in scBCR-seq samples increased with higher sampling depth created by merging technical replicates together (Fig. 4a). These results suggest that scBCR-seq would require substantial improvements in throughput to describe the diversity of the BCR repertoire better. Furthermore, it may be advisable to perform benchmarking studies in smaller model organisms such as mice, or rabbits33 where increased repertoire coverage is biologically feasible.

Isolation and sequence analysis of serum antibodies

The input of serum-derived antibodies in Ab-seq differed across isotypes for the following reasons: (1) different levels of abundance of antibody isotypes in serum, characterized by prior studies71, (2) available reagent for antibody isolation from serum is not created equal: most commercial kits are optimized for IgG, particularly IgG1; for IgA, only IgA1 can be isolated; for IgM, there are no products for human antibody isolation, only for mouse IgM with lower affinity to human IgM. All these factors contributed to the uneven levels of antibody inputs.

Despite the increased sample complexity of polyclonal serum antibodies, our Ab-seq platform identified a substantial number of clonotypes (Fig. 5). The peptides uniquely mapped to a reference clonotype made up only a minor fraction of all mass spectrometry peptides. Further improvements in antibody isolation methods could not only increase yield and purity, but also broaden the scope of antibody isotypes captured. In the future, novel methods to determine peptide sequences72 or intact proteins73 at single-molecule level may supplement or replace bottom-up mass spectrometry as the method of choice to analyze serum antibodies.

Our results showed that most Ab-seq peptides matched with lower-ranking clones in the BCR repertoire. These findings may be due to several reasons: (1) knowledge of correspondence between genomic BCR and proteomic antibody repertoire remains incomplete74,75, (2) previous studies also found a low correlation in clonal abundance between the genomic and proteomic Ig repertoire33,34,76, (3) we did not perform antigen-specific enrichment of antibodies prior to LC-MS/MS, unlike previous works on the same topic34,36,37,77,78, which renders identification of antibodies from reference libraries more challenging.

Paired chain BCR sequencing allowed for both the heavy chain and light chain V(D)J sequences to be recovered from a uniquely mapped Ab-seq peptides on either chain, as demonstrated in the results (Fig. 6). This information would prove valuable in applications such as antibody discovery and engineering. A similar approach has been followed in prior studies78,79,80, albeit only with heavy chain references to infer light chain references, not vice versa. Nevertheless, this approach is expected to be more commonly used as scBCR-seq methods improve in throughput in the future.

Final remarks and recommendations

Our study focused on integrating genomic and proteomic methods for investigating Ig repertoire features. In the long term, as technology matures and the throughput of scBCR-seq is improved, we might see that scBCR-seq eventually replaces bulkBCR-seq as the default method to examine BCR repertoires. This may be driven by the availability of novel single-cell chemistries that offer higher cell throughput81,82 than current droplet-based methods. However, in the near term, there is a place for both bulkBCR-seq and scBCR-seq. Thus, ensuring compatibility between sequencing technologies is important, since the gaps in agreement from different sequencing approaches may complicate the interpretation of sequencing results for the purpose of obtaining a comprehensive picture of the immune repertoire or novel drug development. Similarly, at present, mass spectrometry is the most mature method for protein sequence determination, but applications of mass spectrometry to antibody sequencing remain limited7. More work is needed to further refine this process, with the goal of identifying specific clonotypes of interest directly from blood in the background of polyclonal antibodies in the serum. There is also a need to understand how Ab-seq experimental and computational protocols impact the coverage of antibody diversity36,83,84. We expect antibody repertoire studies will be facilitated in the future by the advances in both machine learning-based46,48 and experiment-based72,73 de novo peptide sequence analysis efforts.

Establishing robust genomic and proteomic workflows for Ig repertoire profiling is essential to both understanding the mechanistic principles of humoral immunity and antibody discovery and engineering efforts33,73,83,85,86,87. Thus, we coalesce our findings into several future-facing recommendations. Recommendation 1–Minimization of bias: the entire workflow, from library preparation and sequencing protocols to data processing for bulkBCR-seq and scBCR-seq should be as similar as possible to minimize technology-based data variation. Future studies should also investigate the variability introduced by different computational preprocessing pipelines. Recommendation 2–Throughput: for BCR-seq samples, particularly for scBCR-seq, the highest possible throughput currently available should be utilized to capture the true diversity of the sample sufficiently. Recommendation 3–Future research: There is a need for more studies investigating the (antigen-specific88) inter-and intra-individual diversity of the antibody repertoire (e.g., quantification of serum antibody clonotypes are there at any one time, profiling antibodies at their generative sources, detecting changes in repertoire diversity as a function of antigenic challenge) and their correspondence with the BCR repertoire.

Methods

BCR repertoire library preparation

(Dataset 1) B-cell isolation

Peripheral blood from a healthy donor was collected in a BD Vacutainer® K2 EDTA tube, and B cells were isolated by negative selection directly from whole blood using MACSxpress® separator (Miltenyi Biotec) with Whole Blood B Cell Isolation Kit (Miltenyi Biotec). Sample acquisition was approved by the Regional Ethics Committee of South-Eastern Norway (project 6544, written informed consent was obtained from the participants of the study). The remaining erythrocytes were lysed with Red Blood Cell Lysis Solution (Miltenyi Biotec), and the remaining B cells were washed with PBS. B cell lysis and RNA extraction were performed using the RNeasy Kit (Qiagen), with the quality and concentration measured using Nanodrop (Thermo Fisher Scientific).

(Dataset 1) Bulk BCR sequencing library preparation

From isolated RNA, 200 ng was used to synthesize cDNA using 1 µl 100 µM isotype-specific reverse transcription primers (Supplementary Table 1), 1 µl 10 mM dNTP Mix (Thermo Fisher Scientific), and nuclease-free water to a final volume of 14.5 µl. The mixture was incubated for 5 min at 65 °C, then placed on ice immediately after. Then, 4 µl 5X RT buffer (Thermo Fisher Scientific), 0.5 µl RiboLock RNase Inhibitor (Thermo Fisher Scientific), and 1 µl Maxima RT enzyme (Thermo Fisher Scientific) were added, and cDNA synthesis was performed at 50 °C—30 min, the reaction was terminated at 85 °C—5 min. The cDNA products were purified using MinElute PCR Purification Kit (Qiagen) and eluted in 20 µl EB buffer.

Purified cDNA was PCR amplified with 1 µl 5′ forward leader region primer mix (Supplementary Table 2), 0.5 µl 100 µM Read2U primer containing the UMI sequence, 10 µl KAPA HiFi HotStart ReadyMix (Roche Molecular Systems), and 4.5 µl nuclease-free water at the following conditions: 96 °C—5 min; 25 cycles of 95 °C—20 s, 68 °C—20 s, 72 °C—20 s; 72 °C—5 min; 4 °C—hold. The PCR product bands were resolved in 1.2% agarose gel in TBE buffer, and the region of interest (around 480 bp) was excised and purified using QIAquick Gel Extraction Kit (Qiagen), then eluted in 20 µl EB buffer.

To create Illumina-compatible sequencing libraries, 10 µl purified 5’ MTPX PCR products were mixed with 0.5 µl 100 µM P5_R1 forward primer, 0.5 µl 100 µM P7_R2 reverse primer (Supplementary Table 3) containing Illumina index sequence, 12 µl KAPA HiFi HotStart ReadyMix, and nuclease-free water to a final volume of 24 µl, and PCR reaction was performed as follows: 96 °C—5 min; 10 cycles of 95 °C—30 s, 68 °C—30 s, 72 °C—30 s; 72 °C—10 min; 4 °C—hold. Library products were purified with AMPure XP beads (Beckman Coulter) at a 1:1 beads ratio. Molarity was determined using Qubit™ 4 Fluorometer, and the quality of the final libraries was inspected with BioAnalyzer High Sensitivity DNA chip (average expected product length 550 bp) and sequenced on the Illumina MiSeq platform (V3 chemistry 300 × 2 bp).

(Dataset 1) Single-cell BCR library preparation

B cells isolated directly from peripheral blood as described previously were loaded onto Chromium Next GEM Chip K (10x Genomics) with the expected output of 10000 cells per well according to the manufacturer’s recommendations. The cells were then partitioned into Gel Beads-in-emulsion (GEMs) using the Chromium Controller instrument (10x Genomics). The subsequent preparation of single-cell RNA libraries was performed according to the Chromium Next GEM Single Cell 5’ Reagent Kits v2 (Dual Index) user guide (10x Genomics). The libraries were sequenced on the Illumina NovaSeq platform with the sequencing parameters recommended by the 10x Genomics (minimum 5000 read pairs per cell; Read 1: 26 cycles, i7 Index: 10 cycles, i5 Index: 10 cycles, Read 2: 90 cycles).

(Dataset 2) Sample collection and initial processing

Samples were obtained from the UK NHS Blood and Transplant service in the form of 10 ml blood cones. Peripheral blood mononuclear cells (PBMCs) were isolated using LeucoSEP tubes (Greiner Bio-One) and resuspended in 50 ml PBS containing 2% FBS. PBMCs were counted using Propidium Iodide/Acridine Orange staining on the Cellometer Auto 2000 Automatic Cell Viability Counter System. 1 × 108 PBMCs were taken for B cell enrichment, and processed immediately. B cells were isolated from PBMCs through two rounds of magnetic enrichment using the human pan B cell isolation kit (Miltenyi Biotec) according to the manufacturer’s protocol. CD19 + B cell purity after enrichment was confirmed to be at least 95% through flow cytometry with a staining panel of anti-CD45, anti-CD3, anti-CD19, anti-CD38, and anti-CD20 antibodies and 7-AAD viability dye (Supplementary Table 4).

(Dataset 2) Bulk BCR sequencing

Aliquots of ~500,000 isolated B cells were pelleted at 400 g for 5 min and then resuspended in 350 µL RLT Plus buffer (Qiagen). Total RNA was then isolated using RNeasy Mini Plus kit (Qiagen) and eluted in 30 µL of nuclease-free water.

First-strand cDNA was generated from 22 µL total RNA using SuperScript RT IV (Invitrogen) and IgA, IgG, IgM, and IgD isotype-specific primers62 including UMIs at 55 °C for 50 min (inactivation at 80 °C for 10 min). CDNA underwent a 0.8x bead clean-up (Beckman Coulter) and was eluted in 50 µL nuclease-free water. The resulting cDNA was used as template for High Fidelity PCR amplification (KAPA, Roche) using a set of 6 FR1-specific forward primers62 including sample-specific barcode sequences (6 bp) and a reverse primer specific to the RT primer (initial denaturation at 95 °C for 3 min, 25 cycles at 98 °C for 20 s, 60 °C for 30 s, 72 °C for 1 min and final extension at 72 °C for 7 min).

To purify the BCR heavy chain amplicons (~600 bp) a 0.5–0.8x double-sided bead clean-up was performed (Beckman Coulter) before quantification by Qubit (Invitrogen) and quality assessment by TapeStation (Agilent). Dual-indexed sequencing adapters (KAPA) were ligated onto 500-ng amplicons per patient using the HyperPrep library construction kit (KAPA) and the adapter-ligated libraries were finally PCR-amplified for 5 cycles (98 °C for 15 s, 60 °C for 30 s, 72 °C for 30 s, final extension at 72 °C for 1 min). Final libraries were quantified by Qubit (Invitrogen) and the quality was assessed by TapeStation (Agilent). Libraries were pooled in an equimolar ratio and sequenced on a single Illumina NextSeq P1 flow cell using the 2 × 300 bp chemistry.

(Dataset 2) Single-cell paired BCR sequencing

Cell hashing was performed to enable multiplet removal89. Each B cell sample was split into two aliquots of at least 500,000 cells for labeling with different TotalSeqC anti-human cell hashing antibodies (BioLegend). Cells at a concentration of 5 million/ml were first incubated with human Fc block (BD Biosciences) on ice for 10 min, with 5 µl added per 500,000 cells. 1 µl cell hashing antibody was added per 500,000 cells and incubated on ice for 20 min. Labeled cells were then washed 3 times with 5 ml D-PBS 2% FBS followed by centrifugation (5 min; 400 g; 4 °C) and supernatant removal. Washed cell pellets were resuspended D-PBS 2% FBS at a concentration of 1.5 million/ml and equal volumes of cells labeled with two different hashtags were combined.

50,000 labeled cells were loaded into each well of a 10x Genomics chip K and processed with the Chromium controller. BCR VDJ and feature barcode libraries were generated using the 10x Genomics immune profiling kit v2 according to the manufacturer’s instructions. Library quality was assessed using the TapeStation (Agilent). Paired-end sequencing with 2 × 150 cycles was performed on a NovaSeq 6000 S4 flow cell by Novogene (Cambridge, UK), with a minimum of 5000 reads per cell for both library types.

(Dataset 3) Public data for bulkBCR-seq and scBCR-seq on pediatric patient samples

Sequencing data was generated by King and colleagues30 and submitted to ArrayExpress (accession numbers: E-MTAB-8999, E-MTAB-9003). Briefly, tonsil samples from pediatric patients were used for B-cell isolation, cells were dyed for surface markers distinctive for B-cell subsets and sorted by flow cytometry, followed by bulk and single-cell sequencing library preparation. The resulting unprocessed sequencing fastq files were utilized in this study.

Sequencing read annotation, error correction, and clonotyping

The generated BCR sequencing data was processed by MiXCR version 4.1.090 with UMI correction, using the built-in MiXCR human reference library. Briefly, UMI-corrected reads were assembled with a minimum of one read per consensus group to keep all sequencing reads, and clonotypes were assembled based on the nucleotide sequence of the CDR3 region + V gene name + J gene name, separated by C genes. 10X scBCR-seq data was also processed using MiXCR, using the preset 10x-vdj-bcr, with additional steps for partial assembly of CDR3 sequences, and assembling the longest possible contigs. All BCR sequencing data from Dataset 1, Dataset 2, and Dataset 3 were processed similarly.

For isotype-specific bulkBCR-seq data, resulting clonotype tables were filtered to keep only clonotypes of the correct isotype, and removed clonotypes that have out-of-frame or stop sequences. Clonal count was defined as the number of unique UMI counts in a clonotype. BCR clonotypes identified as IgE in Dataset 1 bulkBCR-seq data were removed to ensure uniformity in comparison with the accompanying scBCR-seq data, and BCR-seq data in Dataset 2 and Dataset 3.

For scBCR-seq data, the longest assembled cell contigs were filtered to keep only contigs in single-cell droplets, and having only 1 heavy chain and 1 light chain (Supplementary Fig. 4). The clonal count was defined as the number of unique cell barcodes in a clonotype.

In Dataset 1, in order to examine the effect of sequencing depth on repertoire features, technical replicates from scBCR-seq samples were cumulatively merged at the raw data level (fastq files) before undergoing pre-processing. Samples were randomly selected and merged, ranging from two to eight technical replicates, and then the merged data underwent the same preprocessing steps.

The MiXCR output file of each sample provided information on identified features of the V region, including the name and family of the V, D, and J genes, and the identified isotype (C region). In addition, the nucleotide and amino acid sequence of each feature, including the CDRs and FRs, were also provided. Clonotype data from MiXCR were analyzed using the R package immunarch91, which performed the main analyses on immune repertoire data, including clonotype abundance, VH-gene usage, clonal overlap between repertoires, and clonal expansion.

To ensure comparability between bulkBCR-seq and scBCR-seq data, we did not take into account the size of each clonotype when quantifying repertoire features, since the method of calculating clonotype size is different for bulkBCR-seq and scBCR-seq. In addition, the calculation of clonal size in bulkBCR-seq based on UMIs is biased towards cells with higher mRNA expression, such as plasma cells, and does not reflect the biological abundance of the B cells, as opposed to clonal size calculation based on cell barcodes.

Quantification of repertoire features

VH-gene usage analysis

The identified V gene of each clonotype in a repertoire was counted, and the frequencies of all VH genes in a repertoire comprised the VH-gene usage profile. The VH gene of each clonotype has the same weight when calculating VH-gene usage frequencies- i.e., the counting of VH genes did not take into account the frequency of the corresponding clonotypes in a repertoire, in order to minimize the effect of differences in sequencing depth and library preparation methods between bulkBCR-seq and scBCR-seq.

CDR3 sequence overlap

Pairwise CDR3 sequence convergence was quantified based on Jaccard’s overlap index, with values ranging from 0 (no overlap) to 1 (complete overlap) calculated using the following formula: \(J(A,B)=\frac{\left|A\cap B\right|}{\left|A\cup B\right|}\), where A and B represent the amino acid CDR3 sequences of two repertoires.

Repertoire evenness profile

The degree of clonal expansion of each repertoire was calculated using Hill diversity profiles as described previously92. Briefly, the Hill diversity values were calculated for a range of α values from 0 to 10 following the formula: \(D(f)={\left(\mathop{\sum }\nolimits_{i=1}^{n}{f}_{i}^{\alpha }\right)}^{\frac{1}{1-\alpha }}\), where f is the clonal frequency distribution, fi is the clonal frequency of each clonotype, and n the total number of clonotypes. Then, the repertoire evenness profile was obtained by normalizing the Hill diversity values by the number of clonotypes in each repertoire. Evenness values ranged from 0 to 1, with higher values indicating a more uniform distribution of clonal frequency in the repertoire.

Light chain coherence

For scBCR-seq samples, due to the paired chain characteristic of the data, the light chain coherence value (as a percentage) can be calculated, as described by Jaffe and colleagues55. In brief, light chain coherence is defined as the probability of B cell pairs having the same light chain V gene, given that they had the same heavy chain V gene and similar CDRH3 amino acid sequences. There are two measures: light chain coherence within individuals (the percentage of cell pairs from the same donor having the same light chain V-gene divided by all cell pairs having the same heavy chain V-gene name and similar CDRH3 amino acid sequences, but different computed clonotypes) and light chain coherence across individuals (the percentage of cell pairs from two different donors having the same light chain V-gene divided by all cell pairs having same heavy chain V-gene name and similar CDRH3 amino acid sequences). In this work, we quantified light chain coherence only when a pair of cells contain CDRH3 sequences with 100% amino acid sequence identity.

Antibody peptide sequencing by liquid chromatography-tandem mass spectrometry (Ab-seq)

Serum antibody isolation and protease digestion

Whole blood was collected from a healthy donor and centrifuged to separate plasma. Antibodies were purified from the plasma using the Nab Protein A/G Spin kit for IgG, Pierce™ IgM Purification Kit for IgM, Pierce™ Jacalin Agarose for IgA, Pierce™ Affinity Purification Kits, NAb™ Protein L Spin Kit for all Kappa chain antibodies (all from Thermo Scientific). Purified intact antibodies (40 μg IgG, 1.2 μg IgA, 1 μg IgM, and 10 μg κ chain Ig) were enzymatically digested prior to chromatographic separation. Four digestion strategies were employed: digestion with Trypsin (Tryp) only (Trypsin Gold, Mass Spectrometry Grade, Promega Cat 5280), Chymotrypsin (Ct) only (Chymotrypsin MS grade, Thermo Fisher Cat 90056), Chymotrypsin digestion followed by additional Trypsin digestion (Ct + Tryp), and AspN (AspN, Sequencing Grade, Promega Cat V1621).

Liquid chromatography and tandem mass spectrometry

All mass spectrometry experiments were performed by The Proteomics Core Facility at Oslo University Hospital on an EVOSEP liquid chromatography system connected to a quadrupole – Orbitrap (QExactive HF) mass spectrometer (ThermoElectron, Bremen, Germany) equipped with a nanoelectrospray ion source (EasySpray/Thermo). For liquid chromatography separation an 8 cm C18 column was used (Column details: Dr Maisch C18 AQ, 3 μm beads, 100 μm ID, 8 cm long EVOSEP). The standard EVOSEP throughput of 100 samples/day was used.

The mass spectrometer was operated in data-dependent mode to automatically switch between MS and MS/MS acquisition. Survey full scan MS spectra (from m/z 375 to 1500) were acquired in the Orbitrap with resolution R = 60,000 at m/z 200 (after accumulation to a target of 3,000,000 ions in the quadruple). The method used allowed for sequential isolation of the most intense multiply charged ions, up to twelve, depending on signal intensity, for fragmentation on the higher-energy collisional dissociation (HCD) cell using high-energy collision dissociation at a target value of 100,000 charges or maximum acquisition time of 50 ms. MS/MS scans were collected at 30,000 resolutions at the Orbitrap cell. Target ions already selected for MS/MS were dynamically excluded for 30 s. General mass spectrometry conditions were electrospray voltage, 2.0 kV; no sheath and auxiliary gas flow, heated capillary temperature of 250 °C, and normalized HCD collision energy of 28%.

Mass spectrometry data processing and analysis

Proteomics data from LC-MS/MS were processed by MaxQuant version 2.4.493 with default parameters (up to 2 miscleavages allowed, minimal detected peptide length 7 aa, false discovery rate 1%). Data from identified peptide sequences and their respective matching databases were further processed in R.

In addition to the BCR receptor sequences in Dataset 1 as references, the experimental data were searched against 12 decoy databases:

  • UniProt—uniprot-proteome_UP000005640_march_2021 (1 database)94.

  • IGoR igh—10,000 randomly generated naive sequences using the IGoR software suite (1 database)95.

  • IMGT genes—igh (v, d, j genes), igk (v, j genes), igl (v, j, genes) (3 + 2 + 2 databases)96.

  • ImmuneSIM igh, igk, igl—10,000 naive sequences randomly generated with immuneSIM (3 databases)97.

Peptides reported by MaxQuant output were subjected to different stages of filtering. First, only peptides that did not map to contaminant proteins were kept (all peptides). Next, only peptides that mapped to a reference BCR sequence were retained (antibody-specific peptides), excluding peptides that mapped to decoy databases. Subsequently, the peptides were overlapped with their reference CDR3 sequences, and only those with an overlap length of three amino acids or greater were retained (CDR3-overlapping peptides). Finally, amongst CDR3-overlapping peptides, those mapped to a single BCR clonotype (uniquely mapped CDR3-overlapping peptides) were used for clonotype identification and V(D)J sequence reconstruction. Additionally, for Ab-seq peptides that mapped to a scBCR-seq reference clonotype, the cell barcodes associated with the clonotype were utilized to recover the paired chain sequence.

Repertoire feature quantification and statistical analysis

Data analysis was performed in the R programming environment (v4.2.3)98. Repertoire features quantification was performed with the R package immunarch91, with repExplore() for clonal quantification, geneUsage() and geneUsageAnalysis() for quantifying VH-gene usage frequency and pairwise correlation of VH-gene usage frequencies, respectively, and repOverlap() for quantifying CDR3 sequence overlap. CDR3 pairwise edit distance calculations were performed by the package stringdist99.

The rstatix100 package was used for statistical tests. For repertoire features measured pairwise (VH-gene usage correlation, CDRH3 overlap), each data point is a pair of sequencing samples, grouped into different categories. Global differences between groups were tested with Kruskal-Wallis test, and post-hoc analysis was performed with the Wilcoxon Rank Sum test, with p values adjusted for multiple testing by Bonferroni correction. Adjusted p values lower than 0.05 were deemed significant.

Graphics generation

All heatmaps were generated using pheatmap101. Figures 1 and 2 were created with BioRender (Biorender.com), all other figures were created using ggplot2102 and arranged in Adobe Illustrator (adobe.com/products/illustrator).