A dataset comprised of binding interactions for 104,972 antibodies against a SARS-CoV-2 peptide

Engelhart, Emily; Emerson, Ryan; Shing, Leslie; Lennartz, Chelsea; Guion, Daniel; Kelley, Mary; Lin, Charles; Lopez, Randolph; Younger, David; Walsh, Matthew E.

doi:10.1038/s41597-022-01779-4

A dataset comprised of binding interactions for 104,972 antibodies against a SARS-CoV-2 peptide

Data Descriptor
Open access
Published: 26 October 2022

Volume 9, article number 653, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

A dataset comprised of binding interactions for 104,972 antibodies against a SARS-CoV-2 peptide

Download PDF

6856 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

The dataset presented here contains quantitative binding scores of scFv-format antibodies against a SARS-CoV-2 target peptide collected via an AlphaSeq assay that can be used in the development and benchmarking of machine learning models. Starting from three seed sequences identified from a phage display campaign using a human naïve library, four sets of 29,900 antibodies were designed in silico by creating all k = 1 mutations and random k = 2 and k = 3 mutations throughout the complementary-determining regions (CDRs). Of the 119,600 designs, 104,972 were successfully built in to the AlphaSeq library and target binding was subsequently measured with 71,384 designs resulting in a predicted affinity value for at least one of the triplicate measurements. Data include antibodies with predicted affinity measurements ranging from 37 pM to 22 mM. To our knowledge, this dataset is the largest, publicly available dataset that contains antibody sequences, antigen sequence and quantitative measurements of binding scores and provides an opportunity to serve as a benchmark to evaluate antibody-specific representation models for machine learning.

Measurement(s)	Antibody Binding
Technology Type(s)	AlphaSeq
Factor Type(s)	Antibody sequence
Sample Characteristic - Organism	Homo sapiens

The rapid and highly parallel identification of antibodies with defined biological activities by SLISY

Article Open access 03 January 2023

Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries

Article Open access 12 June 2023

De novo design of antibody complementarity determining regions binding a FLAG tetra-peptide

Article Open access 31 August 2017

Background & Summary

Protein modelling is an area of machine learning research that has attractive potential benefits to protein engineering. Given the magnificent size of the design space for a given protein, it is not feasible to measure phenotypic properties of all possible designs empirically. Machine learning methods can help constrain the design space and serve as a basis for recommending designs to test in the lab to save time and reduce cost¹. Such approaches have been taken to engineer enzymes², fluorescent proteins³, and antibodies⁴ among others^5,6. Successful examples of machine learning-enabled protein engineering have relied on access to large, labelled datasets of protein sequences that are typically generated as part of high throughput experimental campaigns. When datasets and models have been made publicly available, the entire field has benefited from ensuing comparisons and benchmarks^1,7.

Although these approaches have been demonstrated for the engineering of antibodies, there remains a scarcity of labelled data available in the public domain to advance this area of research with respect to antibody binding. Large-scale curation efforts have resulted in databases of well over a billion antibody sequences without target or binding affinity values⁸. Other efforts have resulted in datasets with close to one thousand antibodies with labels – either target sequences⁹ or neutralization values¹⁰. Additional work in antibody binding prediction reports a subset of the data generated and used⁴. Outside of antibody binding, one group has published multiple manufacturability measurements for over 100 antibodies that have completed or advanced through the FDA approval pipeline^11,12,13.

This lack of labelled data may be the result of differences in data requirements for training machine learning models compared to finding a design with the target phenotype in the absence of machine learning. Phage display, for example, is intended to provide the researcher with information on a small number of the top binders from a pool of >10⁶ designs^14,15. This type of data is unsuitable for training models because poor binding sequences are unknown and no quantitative binding measurements are generated. Methods that do provide quantitative measurements, such as enzyme-linked immunosorbent assays¹⁶ and surface plasmon resonance¹⁷, rely on isolation of individual antibodies and therefore are significantly lower throughput and more expensive.

Recent methods using engineered yeast expression systems and next generation DNA sequencing overcome some of these challenges and can generate large scale datasets of quantitative protein-protein binding interactions including antibody-antigen interactions^18,19. We utilized such a technique, referred to as AlphaSeq, to generate the dataset described here. Importantly, we first conducted a phage display experiment to identify three candidates that bind to our target, a conserved peptide in coronaviruses. Those candidates were the seed sequences and in addition to designing all single mutants, we performed in silico randomization to introduce two or three random mutations in the complementary determining regions (CDRs) for 119,600 designs. As a result, we generated a dataset that covers a considerable breadth of sequence space and has a wide range of binding measurements that is suitable for training machine learning algorithms.

Methods

Figure 1 provides an overview of the experimental workflow.

Target selection

Antibodies were targeted against a peptide in the HR2 region of the SARS-CoV-2 spike protein to which neutralizing antibodies have been observed²⁰. Additionally, this sequence is reported to have low variability across coronaviruses and could maintain therapeutic value against viral variants²¹. The exact amino acid sequence targeted was PDVDLGDISGINAS.

Phage display

A phage display panning experiment was performed by GenScript USA Inc. to identify candidate binders. A biotinylated target peptide, LCBiot-PDVDLGDISGINAS-OH, (vivitide, LLC) was provided to GenScript USA Inc. The human naïve phage library used by Genscript USA Inc. is marketed to be derived from 300 healthy human donors, has a size of 1.1 × 10¹⁰ and is in Fab format.

AlphaSeq antibody screening

A total of five antibody sequences in scFv format were evaluated in a proof-of-concept AlphaSeq experiment; three of those five sequences bound to the target and were carried forward. All five antibody sequences were tested in both heavy-light (HL) and light-heavy (LH) chain orientation. In general, chain orientation had little to no effect on binding affinity with all but Ab-91-HL (3.39 nM) resulting in predicted K_D values below 1 nM. The best chain orientation was selected for each antibody; HL was selected for Ab-14, LH was selected for Ab-91 and HL was selected for Ab-95.

In silico design of antibody libraries

Two heavy chains and two light chains were arbitrarily selected from the three antibody seed sequences listed in Table 1 for the design of the antibody library: Ab-14-VH, Ab-91-VH, and Ab-14-VL, Ab-95-VL, respectively. The goal of the in silico design process was to generate 29,900 sequence variants for each chain out of the 120,000 total sequence budget, leaving 400 sequences for controls allocated for the binding experiment. K-point mutations, where k = 1, 2, and 3, were produced for the CDRs of each chain. Point mutations were limited to amino acid substitutions; indels were avoided to ensure the amino acid sequence was constant length. Up to k = 3 mutations were chosen to guarantee there was at least one instance where one amino acid substitution occurred in all CDRs of a given chain at a given time. Martin Lab’s CDR rule set was applied to extract the CDRs from their approximate positions in each chain. The number of variants for k = 1 mutations was determined based on the combined number of amino acid positions in the CDRs of a given chain and the total number of possible amino acid substitutions in each position. All k = 1 variants were kept, ensuring that duplicates and original chain sequences were removed. Using the number of sequence variants for k = 1, a scaling factor of ~6 was applied to determine the number of variants to sample from the total number of k = 2 and k = 3 possible sequence variants, as shown in Table 2.

Table 1 Target and Antibody Seed Sequences. CDRs in bold.

Full size table

Table 2 Distribution and incorporation of mutations by library and k mutations. Additionally, there are seven seed sequences with no mutations.

Full size table

AlphaSeq data collection

Yeast media

Yeast peptone dextrose (YPAD), yeast peptone galactose (YPAG), and synthetic drop out (SDO) media supplemented with 80 mg/mL adenine were made according to standard protocols. Suppliers used for our yeast media are as follows: Bacto Yeast Extract (Life Technologies), Bacto Tryptone (Fisher BioReagents), Dextrose (Fisher Chemical), Galactose (Sigma-Aldrich), Adenine (ACROS Organics), Yeast Nitrogen Base w/o Amino Acids (Thermo Scientific), SC-His-Leu-Lys-Trp-Ura Powder (Sunrise Science Products), Yeast Synthetic Drop-out Medium Supplements (Sigma-Aldrich), L-Histidine (Fisher BioReagents), L-Tryptophan (Fisher BioReagents), L-Leucine (Fisher BioReagents), Uracil (ACROS Organics), and Bacto Agar (Fisher BioReagents).

Isogenic yeast transformation

AlphaSeq compatible plasmids encoding yeast surface display cassettes were constructed by Twist Bioscience and resuspended at 100 ng/µL. 100 ng of plasmid was digested with PmeI enzyme for 1 hr at 37 °C to linearize, leaving chromosomal homology for integration into the ARS314 locus at both the 5′ and 3′ ends as previously described¹⁸. Yeast transformations were performed with Frozen-EZ Yeast Transformation Kit II (Zymo Research) according to manufactures instructions. Yeast were plated on SDO-Trp plates and grown at 30 °C for 2–3 days. Successful transformants were struck out onto YPAD plates and grown overnight at 30 °C.

Protein expression validation – Flow cytometry

Yeast were inoculated in YPAD and grown overnight at 30 °C. Yeast were labelled with FITC-anti-C-myc antibody (Immunology Consultants Laboratory, Inc.) in PBS (Gibco) + 0.2% BSA (Thermo Fisher Scientific) for 30 minutes at RT. Yeast were pelleted and resuspended in PBS + 0.2% BSA and read on a LSRII cytometer.

DNA library construction

A 300 bp oligonucleotide pool synthesized by Twist Bioscience was resuspended at 20 ng/µL in molecular grade water. Libraries were PCR amplified from the oligonucleotide pool using KAPA DNA polymerase (Roche). The oligonucleotide amplification fragment was inserted into the seed scFv backbone using Gibson isothermal assembly (NEB), as well as a second DNA fragment containing a randomized DNA barcode. The assembled barcoded antibody DNA library was PCR amplified. Fragments were run on a 0.8% agarose gel and extracted using Monarch Gel Purification kit (NEB).

Yeast library transformation

MATa AlphaSeq yeast were grown for 6 hours in YPAG media to induce SceI expression, as described previously¹⁸. All spin steps were performed at 3000 RPM for 5 minutes. Yeast were spun down and washed once in 50 mL 1 M Sorbitol (Teknova) + 1 mM CaCl₂ solution. Washed yeast were resuspended in a solution of 0.1 M LiOAc/1 mM DTT and incubated shaking at 30 °C for 30 minutes. After 30 minutes, yeast were spun down and washed once in 50 mL 1 M Sorbitol + 1 mM CaCl₂ solution. Yeast were resuspended to a final volume of 400 µL in 1 M Sorbitol + 1 mM CaCl₂ solution and incubated with DNA for at least 5 minutes on ice. Yeast were electroporated at 2.5 kV and 25 uF (BioRad). Immediately following electroporation, yeast were resuspended in 5 mL of 1:1 solution of 1 M Sorbitol:YPAD and incubated shaking at 30 °C for 30 minutes. Recovered yeast cells were spun down and resuspended in 50 mL of SDO-Trp media and transferred to a 250 mL baffled flask. 20 µL of resuspended cells were plated on SDO-Trp to determine transformation efficiency. Both the flask and plate were incubated at 30 °C for 2–3 days. After 2–3 days, transformation efficiency was determined by counting colonies on the SDO-Trp plate.

Nanopore barcode mapping

Genomic DNA from yeast libraries was extracted using Yeast DNA Extraction Kit (Thermo Fisher Scientific) following the manufacturer’s instructions. A single round of qPCR was performed to amplify a fragment pool from the genomic DNA containing the gene through the associated DNA barcode. qPCR was terminated before saturation to minimize PCR bias, generally between 15–20 cycles. The final amplified fragment was concentrated with KAPA beads, quantified with a Quantus (Promega), prepped with a SQK-LSK-110 ligation kit (Oxford Nanopore) and sequenced with a Minion R10 flow cell (Oxford Nanopore) following the manufacturer’s instructions. Each sequencing read was aligned to the set of expected antibody sequences from the in silico antibody library using BLASTN²² to determine the mapping between DNA barcodes and antibody sequence; only DNA barcodes with at least 2 reads observed were considered, and each DNA barcode was matched to the most common BLASTN antibody match among its constituent reads.

Library-on-library AlphaSeq assays

Two mL of saturated MATa and MATalpha library were combined in 800 mL of YPAD media and incubated at 30 °C in a shaking incubator. Three technical replicates were performed for each assay (Table 3). After 16 hr, 100 mL of yeast culture was washed once in 50 mL of sterile water and transferred to 600 mL of SDO-lys-leu with 100 nM ß-estradiol (Sigma) for 24 hr at 30 °C in a shaking incubator. After 24 hr, 100 mL of yeast was transferred to fresh SDO-lys-leu with 100 nM ß-estradiol for an additional 24 hr at 30 °C in a shaking incubator. In addition to the antibody libraries described above, control yeast strains comprising a small network of BCL2-family proteins as previously described¹⁸ were included in each experiment to act as a set of standards for which BLI-derived interaction affinities were known a priori.

Table 3 Composition of AlphaSeq assays.

Full size table

Library preparation for next-generation sequencing

Genomic DNA was extracted using Yeast DNA Extraction Kit (Thermo Fisher Scientific) following manufacturer’s instructions. qPCR was performed to amplify a fragment pool from the genomic DNA and to add standard Illumina sequencing adaptors and assay specific index barcodes. qPCR was terminated before saturation to minimize PCR bias, generally between 23–27 cycles. The final amplified fragment was concentrated with KAPA beads, quantified with a Quantus (Promega), and sequenced with a NextSeq 500 sequencer (Illumina).

AlphaSeq bioinformatics

Sequencing data were analyzed to identify the MATa and MATalpha barcode pairs present among diploid yeast. The observed number of sequencing reads for each MATa/MATalpha combination were normalized according to frequency among haploid yeast to account for uneven distribution of the input populations. Each aα pair was then assigned a score representing the ratio of observed sequencing reads to expected sequencing reads assuming random mating. A linear regression was performed comparing these normalized sequencing scores to known affinities for the control yeast strains and this regression was utilized to assign estimated affinities to all other aα pairs for each mating replicate.

Data Records

Data structure and repository

A single dataset was generated during this study. This data set contains the output of two AlphaSeq assay performed as part of a single study and is deposited at Zenodo²³ (https://doi.org/10.5281/zenodo.5095284). The dataset contains the variables listed in Table 4.

Table 4 Variables and associated descriptions.

Full size table

Data sets and file types

The data are stored in a single.csv file. All data can be downloaded from Zenodo²³.

Technical Validation

Library coverage

To ensure sufficient proportions of designed sequences were assembled into each library and to confirm no incorporation bias based on number of mutations, we evaluated the percentage incorporation per k mutations per library (Table 2). The intra-library variation is small, never exceeding 2%, indicating no bias due to number of mutations. Because each library is constructed separately, it is expected that the inter-library differences in incorporation will be greater than that of the intra-library range. This is the case with incorporation percentages ranging from 75.2% to 99.7%. Given these observations, we conclude the libraries are constructed sufficiently.

Reproducibility

To assess variation attributable to the AlphaSeq process, each yeast mating experiment was performed in triplicate, with separate determination of K_d values for each technical replicate. Figure 2a includes matrices with pairwise Pearson Correlation values for each pair of replicates within a given library. The Pearson Correlation ranges from 0.66 (AAYL52 Rep 1 vs Rep 2) to 0.93 (AAYL50 Rep 1 vs Rep 3). Figure 2b presents a visualization of the pairwise comparison of each replicate within AAYL49. The observed phenomena of better correlation at lower predicted affinity values holds true across each library. Affinity measurements, especially sub-micromolar affinities, are highly reproducible between AlphaSeq replicates.

Analysis of standards

Control yeast strains comprising a small network of BCL2-family proteins, as previously described^24,25 were included in each experiment to act as a set of standards for which bio-layer interferometry-derived interaction affinities were known a priori. Figure 3a shows the correspondence between known K_d values and AlphaSeq-predicted affinity values for these PPIs, with a computed linearity of R² = 0.85.

Analysis of binding affinity for 1/2/3-site variants

To further validate the assay results by identifying expected patterns, AlphaSeq-derived binding affinities were compared for all antibody sequences, binned by the number of mutations separating each antibody from its seed sequence (1, 2 or 3). Results are shown in Fig. 3b; as expected, median affinity decreases with each additional mutation (2.08 log₁₀ nM, 2.70 log₁₀ nM, 3.21 log₁₀ nM respectively for 1, 2, 3 mutations) while variance increases with each additional mutation (interquartile range 1.69 log₁₀ nM, 2.06 log₁₀ nM, 2.09 log₁₀ nM). In other words, each added mutation increases the probability of breaking the antibody but there is also more room for improvement over the wild type.

Usage Notes

Binding to negative controls

The inclusion of MATα yeast expressing no POI serves as an opportunity to identify MATa yeast with non-specific binding. These negative control yeast strains: AlphaNeg1, AlphaNeg2, and AlphaNeg3 are expressing AGA2 with an N-terminal HA epitope tag and C-terminal Myc tag without a POI. As such, many of the entries within the dataset represent interactions between MATa yeast with a MATα yeast expressing a negative target. The values associated with these measurements range from 1.03 log₁₀ nM to 7.14 log₁₀ nM for assay 1 and 1.35 log₁₀ nM to 7.32 log₁₀ nM for assay 2. As these values are higher than the distribution of values for on-target binding, they serve as an additional confirmation that the pred_affinity measurements are resultant of on-target binding. The binding affinities measured against negative controls represent some combination of nonspecific yeast mating and molecular artifacts introduced to the barcodes during PCR and sequencing and can act as an empirical readout of the limit of blank for this dataset. Note that given the increase in technical variation observed with increasing pred_affinity values, it is not recommended to background subtract these values from the on-target pred_affinity measurements.

Normalization among replicates and assays

Each assay and replicate contains each of the three seed sequences and can be used to normalize the data among the assays or replicates. These sequences can also be included in future assays to allow for integration of additional data. Additionally, the regression used to transform sequencing abundances to predicted affinity values is performed once for each replicate and then applied to the entire replicate; relative ranking of interactions within a replicate are insensitive to any technical variation in that calculation, but such error will propagate to all quantitative predicted affinity measurements in that replicate.

Sequences without a pred_aff value

Data entries in which a sequence and target pair is specified but does not have a pred_aff value indicate a poor binding interaction. These antibody sequences are observed in DNA sequencing of the MATa haploid yeast population, but not among diploid yeast, affirming the sequence is present in the MATa library but no mating was observed. There are multiple options for how to treat these entries in downstream applications, including removing them from the dataset. While not conclusive, absence of diploids is strong evidence of poor binding affinity; imputing affinity values to indicate as such may be advantageous. Values could be imputed, for example, as the maximum pred_aff value or as the median pred_aff value of sequences not having measurements in all replicates.

Code availability

Code associated with the randomization of antibody designs is available on GitHub (https://github.com/mit-ll/Insilico_Ab_Variant_Generator).

Code used for sequence analysis is functionally similar to code that has previously been released¹⁸. This code can be accessed on GitHub (https://github.com/dyounger/yeast_synthetic_agglutination).

References

Xu, Y. et al. Deep dive into machine learning models for protein engineering. J. Chem. Inf. Model. 60, 2773–2790 (2020).
Article CAS PubMed Google Scholar
Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. 116, 8852–8858 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Saito, Y. et al. Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7, 2014–2022 (2018).
Article CAS PubMed Google Scholar
Liu, G. et al. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics 36, 2126–2133 (2020).
Article CAS PubMed Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article CAS PubMed Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rao, R. et al. Evaluating protein transfer Learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
PubMed PubMed Central Google Scholar
Kovaltsuk, A. et al. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J. Immunol. 201, 2502–2509 (2018).
Article CAS PubMed Google Scholar
Ferdous, S. & Martin, A. C. R. AbDb: antibody structure database—a database of PDB-derived antibody structures. Database https://doi.org/10.1093/database/bay040 (2018).
Article PubMed PubMed Central Google Scholar
Yoon, H. et al. CATNAP: a tool to compile, analyze and tally neutralizing antibody panels. Nucleic Acids Res. 43, W213–W219 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lu, X. et al. Deamidation and isomerization liability analysis of 131 clinical-stage antibodies. mAbs 11, 45–57 (2019).
Article CAS PubMed Google Scholar
Yang, R. et al. Rapid assessment of oxidation via middle-down LCMS correlates with methionine side-chain solvent-accessible surface area for 121 clinical stage monoclonal antibodies. mAbs 9, 646–653 (2017).
Article CAS PubMed PubMed Central Google Scholar
Jain, T. et al. Biophysical properties of the clinical-stage antibody landscape. Proc. Natl. Acad. Sci. 114, 944–949 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Clackson, T., Hoogenboom, H. R., Griffiths, A. D. & Winter, G. Making antibody fragments using phage display libraries. Nature 352, 624–628 (1991).
Article ADS CAS PubMed Google Scholar
Winter, G., Griffiths, A. D., Hawkins, R. E. & Hoogenboom, H. R. Making antibodies by phage display technology. Annu. Rev. Immunol. 12, 433–455 (1994).
Article CAS PubMed Google Scholar
Engvall, E. & Perlmann, P. Enzyme-linked immunosorbent assay (ELISA) quantitative assay of immunoglobulin G. Immunochemistry 8, 871–874 (1971).
Article CAS PubMed Google Scholar
Liedberg, B., Nylander, C. & Lunström, I. Surface plasmon resonance for gas detection and biosensing. Sens. Actuators 4, 299–304 (1983).
Article CAS Google Scholar
Younger, D., Berger, S., Baker, D. & Klavins, E. High-throughput characterization of protein–protein interactions by reprogramming yeast mating. Proc. Natl. Acad. Sci. 114, 12166–12171 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Starr, T. N. et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182, 1295–1310.e20 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lai, S.-C. et al. Characterization of neutralizing monoclonal antibodies recognizing a 15-residues epitope on the spike protein HR2 region of severe acute respiratory syndrome coronavirus (SARS-CoV). J. Biomed. Sci. 12, 711–727 (2005).
Article CAS PubMed Google Scholar
Ricke, D. O. Two different antibody-dependent enhancement (ADE) risks for SARS-CoV-2 antibodies. Front. Immunol. 12, 640093 (2021).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Walsh, M. et al. mit-ll/AlphaSeq_Antibody_Dataset: Initial release of AlphaSeq antibody dataset. Zenodo https://doi.org/10.5281/zenodo.5095284 (2021).
Procko, E. et al. A computationally designed inhibitor of an Epstein-Barr viral Bcl-2 protein induces apoptosis in infected cells. Cell 157, 1644–1656 (2014).
Article CAS PubMed PubMed Central Google Scholar
Berger, S. et al. Computationally designed high specificity inhibitors delineate the roles of BCL2 family proteins in cancer. eLife 5, e20352 (2016).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors acknowledge the contributions of Dr. Darrell O. Ricke in the selection of the target peptide sequence. Additionally, the authors appreciate the comments made on the manuscript by Dr. Alexander Titus. DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the United States Air Force under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force. © 2021 Massachusetts Institute of Technology. Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

Author information

Matthew E. Walsh
Present address: Department of Environmental Health and Engineering, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
These authors contributed equally: Emily Engelhart, Ryan Emerson.

Authors and Affiliations

A-Alpha Bio, Inc., Seattle, WA, USA
Emily Engelhart, Ryan Emerson, Daniel Guion, Mary Kelley, Charles Lin, Randolph Lopez & David Younger
Massachusetts Institute of Technology Lincoln Laboratory, Lexington, MA, USA
Leslie Shing, Chelsea Lennartz & Matthew E. Walsh

Authors

Emily Engelhart
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Emerson
View author publications
You can also search for this author in PubMed Google Scholar
Leslie Shing
View author publications
You can also search for this author in PubMed Google Scholar
Chelsea Lennartz
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Guion
View author publications
You can also search for this author in PubMed Google Scholar
Mary Kelley
View author publications
You can also search for this author in PubMed Google Scholar
Charles Lin
View author publications
You can also search for this author in PubMed Google Scholar
Randolph Lopez
View author publications
You can also search for this author in PubMed Google Scholar
David Younger
View author publications
You can also search for this author in PubMed Google Scholar
Matthew E. Walsh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.E. performed yeast strain construction and validation, yeast library construction and barcode mapping, and yeast mating assay and Illumina library preparation, performed data analysis and wrote the manuscript. R.E. performed data analysis and wrote the manuscript. L.S. performed the in silico randomization and wrote the manuscript. C. Lennartz performed data analysis and visualization. D.G. performed yeast mating assay and Illumina library preparation. M.K. performed yeast strain construction and validation and yeast library construction and barcode mapping. C. Lin performed yeast strain construction and validation. R.L. conceived the project and wrote the manuscript. D.Y. conceived the project and wrote the manuscript. M.E.W. conceived and supervised the project, selected the target peptide sequence, and wrote the manuscript.

Corresponding author

Correspondence to Matthew E. Walsh.

Ethics declarations

Competing interests

R.L. and D.Y. are the founders and current employees of A-Alpha Bio, Inc. (A-Alpha Bio) and own stock/stock options of A-Alpha Bio. E.E., R.E., C. Lin, M.K and D.G. are employees of A-Alpha Bio; all employees own stock/stock options of A-Alpha Bio. A-Alpha Bio has a patent (US10988759B2) relating to certain research described in this article. L.S., C. Lennartz, and M.E.W. declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Engelhart, E., Emerson, R., Shing, L. et al. A dataset comprised of binding interactions for 104,972 antibodies against a SARS-CoV-2 peptide. Sci Data 9, 653 (2022). https://doi.org/10.1038/s41597-022-01779-4

Download citation

Received: 25 November 2021
Accepted: 14 October 2022
Published: 26 October 2022
DOI: https://doi.org/10.1038/s41597-022-01779-4
Springer Nature Limited

This article is cited by

The rise of big data: deep sequencing-driven computational methods are transforming the landscape of synthetic antibody design
- Eugenio Gallo
Journal of Biomedical Science (2024)
Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries
- Lin Li
- Esther Gupta
- Matthew E. Walsh
Nature Communications (2023)

A dataset comprised of binding interactions for 104,972 antibodies against a SARS-CoV-2 peptide

Abstract

Similar content being viewed by others

The rapid and highly parallel identification of antibodies with defined biological activities by SLISY

Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries

De novo design of antibody complementarity determining regions binding a FLAG tetra-peptide

Background & Summary

Methods

Target selection

Phage display

AlphaSeq antibody screening

In silico design of antibody libraries

AlphaSeq data collection

Yeast media

Isogenic yeast transformation

Protein expression validation – Flow cytometry

DNA library construction

Yeast library transformation

Nanopore barcode mapping

Library-on-library AlphaSeq assays

Library preparation for next-generation sequencing

AlphaSeq bioinformatics

Data Records

Data structure and repository

Data sets and file types

Technical Validation

Library coverage

Reproducibility

Analysis of standards

Analysis of binding affinity for 1/2/3-site variants

Usage Notes

Binding to negative controls

Normalization among replicates and assays

Sequences without a pred_aff value

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

The rise of big data: deep sequencing-driven computational methods are transforming the landscape of synthetic antibody design

Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries

Search

Navigation