Scalable design of orthogonal DNA barcode libraries

Gowri, Gokul; Sheng, Kuanwei; Yin, Peng

doi:10.1038/s43588-024-00646-z

Scalable design of orthogonal DNA barcode libraries

Brief Communication
Open access
Published: 07 June 2024

Volume 4, pages 423–428, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue Submit your manuscript

Scalable design of orthogonal DNA barcode libraries

Download PDF

933 Accesses
4 Altmetric
Explore all metrics

Abstract

Orthogonal DNA barcode library design is an essential task in bioengineering. Here we present seqwalk, an efficient method for designing barcode libraries that satisfy a sequence symmetry minimization (SSM) heuristic for orthogonality, with theoretical guarantees of maximal or near-maximal library size under certain design constraints. Seqwalk encodes SSM constraints in a de Bruijn graph representation of sequence space, enabling the application of recent advances in discrete mathematics¹ to the problem of orthogonal sequence design. We demonstrate the scalability of seqwalk by designing a library of >10⁶ SSM-satisfying barcode sequences in less than 20 s on a standard laptop.

BioBrick Assembly Standards and Techniques and Associated Software Tools

Combinatorial-Hierarchical DNA Library Design Using the TeselaGen DESIGN Module with j5

Computational Sequence Design with R2oDNA Designer

Main

Orthogonal DNA barcode libraries are widely used in modern biotechnology. For example, orthogonal sequences are used to barcode protein targets in DNA-based bioimaging², to label RNA molecules in individual cells for single-cell studies³ and to program the assembly of components in a synthesis process⁴, among many other applications^5,6,7,8,9. The number of addressable features (cells, protein targets and so on) in these methods is dependent on the size of the orthogonal DNA sequence library that is used. As such, the problem of designing large orthogonal DNA sequence libraries appears across many areas of study in bioengineering.

Depending on the specific application, there are different approaches to designing orthogonal sequence libraries. One powerful approach is using physical models to design sequences such that off-target interactions are thermodynamically unfavorable¹⁰. Currently, scaling thermodynamic design tools to massive libraries (for example, exceeding 10⁵ nucleotides) or exhaustive searches of large sequence space (for example, all 4²⁵ possible 25-nt sequences) is prohibitively computationally expensive¹¹.

One widely used alternative is sequence symmetry minimization (SSM)^{4,7,12,13,14,15,16,17}. A set of sequences is considered to satisfy SSM for length k if no subsequence of length k appears more than one time in the set. In technologies with sequencing-based barcode readouts, satisfying SSM decreases the likelihood of incorrectly assigning barcodes¹⁷. In technologies with hybridization-based barcode readouts, satisfying SSM decreases the probability of off-target binding^14,15. It is important to note that, while an informative heuristic, SSM does not explicitly capture thermodynamic properties of sequences, and cannot guarantee low off-target binding energies (Supplementary Notes 6 and 7).

Sequence symmetry has the appealing property that it can be mathematically represented using de Bruijn graphs^18,19,20. In this Brief Communication, we show that this graph representation of sequence space enables a massively scalable approach to DNA barcode design. We build on recent advances in discrete mathematics^1,21 to develop seqwalk, an efficient tool for designing SSM-satisfying barcode libraries, and provide theoretical bounds on orthogonal library size under various design constraints. We provide accessible software implementations of seqwalk and show that it is capable of designing >10⁶ 25-nt barcode sequences in less than 20 s on a single standard central processing unit (CPU) core, with provable guarantees of maximal library size under an SSM constraint.

k-mer graphs for orthogonal sequence design

The key observation underlying seqwalk is that orthogonality constraints in sequence design problems can be naturally encoded in de Bruijn graph representations of sequence space^18,19,20. De Bruijn graphs, also known as k-mer graphs, are sequence representations that have been well studied in discrete mathematics^1,21,22. A k-mer is a length k sequence. A k-mer graph has all possible k-mers as nodes, and edges between k-mers that overlap by k − 1 symbols. In particular, if a k-mer k₁ can be transformed into a k-mer k₂ by removing its first symbol and appending a symbol, then there is a directed edge from k₁ to k₂.

On a k-mer graph, a length L sequence can be represented as a path over L − k + 1 nodes. The traversed nodes will correspond to each k-mer that appears in the sequence. A set of sequences that can be represented as nonintersecting paths on a k-mer graph share no common k-mers and, thus, satisfy SSM for the corresponding k. This points toward a method for generating sequences that implicitly satisfy SSM for length k: one can simply select several nonintersecting paths on a k-mer graph. One way to produce nonintersecting paths on a graph is to take a single self-avoiding walk and then partition this walk into multiple nonintersecting paths. The longest possible self-avoiding walks on a graph are Hamiltonian paths, which visit every node of the graph exactly one time. A partitioned Hamiltonian path will result in sequences that fully occupy k-mer space and, thus, yield maximally sized orthogonal sequence libraries.

In seqwalk, we apply a recently discovered mathematical technique for traversing de Bruijn graphs, which yields Hamiltonian paths in amortized O(1) time and memory per node¹, to efficiently and scalably design orthogonal sequence libraries (Fig. 1). While finding Hamiltonian paths in arbitrary graphs is computationally hard²³, the mathematical structure of de Bruijn graphs enables efficient identification of Hamiltonian paths. The core algorithm is simple: our implementation requires less than 100 lines of code, including output formatting.

**Fig. 1: Workflow of graph-based sequence design algorithm.**

Performance benchmarks

To understand the practical relevance of the efficiency of seqwalk, we compare its time and memory efficiency with DeLOB⁷, an existing approach for large orthogonal library design. In brief, DeLOB begins with a large candidate library of random sequences, uses BLAST²⁴ to identify pairs of sequences that violate SSM with k = 12, then chooses a subset of sequences that do not violate SSM. While DeLOB candidate libraries can be refined on the basis of sequence design constraints (such as melting temperature or lack of intramolecular secondary structure), for the purpose of benchmarking, we reimplemented DeLOB with unconstrained candidate libraries of random sequences (Methods).

We run DeLOB with various numbers of candidate sequences and compare this with seqwalk run with an SSM constraint of k = 12. We find that seqwalk produces about two orders of magnitude more sequences than DeLOB run for a similar time (Fig. 2). In our benchmarking setup, DeLOB has a peak memory usage of nearly 100 GB to design about 3.6 × 10⁴ sequences, making it incompatible with current personal computing hardware. In comparison, seqwalk has peak memory usage of less than 1 GB and produces over 10⁶ sequences. In summary, seqwalk is capable of efficiently producing SSM-satisfying sequence libraries, requiring only standard personal computing hardware to design libraries exceeding 10⁶ sequences in less than 30 s.

Fig. 2: Time and memory efficiency of seqwalk in comparison with a reimplementation of the DeLOB method, described in ref. ⁷, for the problem of designing 25-nt barcodes satisfying SSM k = 12, with no additional sequence constraints.

Seqwalk’s exhaustive traversal of sequence space is also useful for designing small orthogonal libraries with minimal sequence symmetry. We demonstrate this by comparing a seqwalk library with a widely used multiplexing barcode library for a single-cell RNA sequencing method, MULTI-seq²⁵. The original MULTI-seq library consists of nine distinct 8-nt barcodes, designed to have pairwise Hamming distances ≥3. This design strategy yields barcodes with high sequence symmetry, resulting in barcode ambiguity that may give rise to experimental artifacts¹⁷. The seqwalk equivalent of this library, designed with the smallest k to yield at least nine sequences (k = 3), has minimal barcode ambiguity, lower homopolymer prevalence, improved pairwise Hamming distances and similar guanine-cytosine (GC) diversity (Supplementary Note 5 and Supplementary Table 1).

Sequence design under additional constraints

In many applications, there are additional constraints to orthogonal sequence libraries beyond crosstalk between barcodes. One common constraint is the prevention of crosstalk with reverse complements of sequences in the library. For sequence design under this constraint, seqwalk integrates two approaches: a filtering approach, and an adaptation of the Hierholzer algorithm²⁶ for four-letter libraries with odd k (Methods).

Seqwalk design can also consider other common constraints such as requiring GC content or melting temperature within a window, the absence of specific sequence patterns and the absence of substantial secondary structure. We provide efficient algorithms for filtering seqwalk libraries for these characteristics (Supplementary Notes 1–4). We find that three-letter seqwalk libraries are particularly amenable to such filtering, as they have sequences with lower variance in GC content and melting temperature (Supplementary Figs. 1 and 2), low prevalence of secondary structure (Supplementary Fig. 3) and less crosstalk with reverse complements (Methods).

Theoretical results

SSM-satisfying sequence libraries designed by the partitioning of a Hamiltonian path (such as in seqwalk) are maximally sized. This can be trivially proven by contradiction, by noting that every possible k-mer in the sequence space appears in the library. If there existed a larger library of SSM-satisfying sequences, it would use a larger number of k-mers and, thus, would repeat k-mers and not satisfy SSM (Methods).

Fundamental results about de Bruijn graphs²² almost directly yield a closed form expression for the number of sequences in seqwalk libraries under different design parameters. For alphabet size m, sequence length L and SSM constraint k, the number of possible orthogonal sequences N is the number of nodes in the k-mer graph divided by the number of nodes required to represent a sequence of length L. More precisely,

$$N=\left\lfloor \frac{{m}^{\rm{k}}}{L-k+1}\right\rfloor.$$

(1)

This theoretical result has practical relevance. For a practitioner who wishes to design a certain number of sequences, the strongest possible SSM constraint (that is, the smallest possible k) can be determined using the relationship between k and N. Given a desired library size of N_d, sequence length L and alphabet size m, we can choose the smallest k such that N ≥ N_d. Designing a library using the resulting k value yields a maximally orthogonal (as defined by SSM) library with the desired number of sequences. This function is implemented in the seqwalk software library and named max_orthogonality (Extended Data Fig. 1).

While the proof of maximal library size does not hold under additional design constraints (such as reverse complement prevention, GC content filtering and so on), we can estimate or exactly state useful lower bounds on the size of seqwalk libraries after downstream filtering. For example, we can place lower bounds on the number of sequences present after a filtering for a specific sequence pattern of length p ≤ k. The number of k-mers containing a specific pattern of length p is

$${K}_{\rm{p}}\le (k-p+1)\times {m}^{\rm{k-p}},$$

(2)

where m is the size of the alphabet. Since no k-mer appears in more than one sequence in the library, we must remove at most K_p sequences from our library to remove all sequences containing a pattern of length p. As such, the size of the filtered library, N_p, is

$${N}_{\rm{p}}\ge N-{K}_{\rm{p}}.$$

(3)

Such lower bounds are simple to determine for practically relevant pattern constraints, such as the prevention of homopolymeric regions (Supplementary Note 4). For certain choices of k, L and p, the size of pattern-free seqwalk libraries is near identical to the maximum possible library size under no pattern constraint. For example, for patterns with length p = k, at most one sequence is removed per pattern.

Additionally, we derive a lower bound on the size of seqwalk libraries upon filtering for orthogonality with reverse complements (Methods). For the case of three-letter libraries with odd k, we show that the size of a seqwalk library that satisfies orthogonality with reverse complements, N_rc, can be bounded by

$${N}_{{\mathrm{rc}}}\ge N-{2}^{\rm{k-1}}.$$

(4)

The size of seqwalk libraries under GC content constraints is not as easily determined analytically. However, empirical results show that seqwalk libraries have consistent distributions of GC content, resembling the binomial distribution expected of uniformly random sequences (Supplementary Note 1). As such, these distributions can be used to estimate the size of seqwalk libraries under GC content constraints.

Implementation as a software tool

We have implemented the seqwalk algorithm and additional filtering tools in a ‘pip’ distributed Python package (seqwalk, documented at seqwalk.readthedocs.io). Additionally, we have developed an interactive, code-free, web-based seqwalk interface in a publicly accessible Google Colaboratory notebook (link on seqwalk.readthedocs.io), based on a Julia implementation. While the Julia implementation is faster, the Python implementation and package allow for easier incorporation with the existing ecosystem of tools for sequence design and analysis. We envision the use of seqwalk as a part of a sequence design pipeline, with downstream filtering (experimental validation, genomic homology filtering and so on) as necessary for specific application contexts. Due to the simplicity of the underlying algorithms, we expect that others can implement our design method in other settings and modify it as necessary for different design pipelines.

Discussion

In this paper, we introduced seqwalk, a method for scalably designing DNA barcode libraries that satisfy SSM constraints. Seqwalk enables the design of SSM-satisfying libraries consisting of millions of sequences, using only standard personal computing hardware.

While seqwalk can be applied to many design problems, its use of the SSM heuristic makes it more directly applicable in certain experimental contexts. In particular, seqwalk is well suited for problems where nuanced biophysical properties (that is, exact ΔG) do not need to be tightly controlled (Supplementary Notes 6 and 7). In settings where biophysical or other experimental design constraints are strong, seqwalk can be used upstream of other design tools as a way to quickly constrain design space on the basis of an SSM heuristic. We expect that seqwalk can be valuable, either alone or in conjunction with other sequence design tools, for the rapidly growing class of high-throughput biological methods that use synthetic DNA sequences as barcodes for different biomolecular features (that is, samples, cells, protein targets, plasmids and so on).

Additionally, the theoretical guarantees on the size of seqwalk libraries can be used to guide design choices in experimental method development. Using the results presented in this paper, one can quickly assess tradeoffs between design parameters and orthogonal sequence library size.

At least two threads of future investigation are raised by this work. First, the graph representation used in seqwalk only captures orthogonality as defined by SSM. Is it possible to generalize the approach to other notions of orthogonality, such as those defined by physical models? Second, as SSM remains an appealing orthogonality heuristic for its tractability, can we precisely identify experimental settings where it is insufficient?

Graph representations of sequences are commonly used to describe naturally occurring biological sequences²⁷. There is growing interest in sequence representations amenable to design tasks, in addition to descriptive tasks^28,29. With seqwalk, we demonstrate that graph-based sequence representations enable massive efficiency improvements in SSM-satisfying sequence library design.

Methods

Clarifying notions of orthogonality

Here, we will try to be more precise about what we mean by ‘orthogonality’ and ‘crosstalk’. We will separate the discussion for two broad application categories of DNA barcodes: sequencing-based and hybridization-based.

For sequencing-based barcodes, we consider two barcodes (A and B) to have crosstalk if they cannot be easily disambiguated on the basis of sequencing readout of the barcode. In other words, if barcode A and barcode B can be distorted into the same sequence through the process of library preparation, sequencing and alignment.

For hybridization-based barcodes, we consider two sequences (A and B) to have crosstalk if they can stably hybridize with each other’s reverse complements. In other words, if a complex between A and B* or A* and B is likely to form with experimentally relevant propensity, we consider A and B to have crosstalk.

If we think of A and B as probes, with A* and B* being their respective targets, we consider crosstalk to be the binding of a probe to an incorrect target. We do not by default consider binding between A and B to be crosstalk.

For many, but not all, applications, this is a sufficient characterization of crosstalk. In the case of multiplexed DNA exchange imaging, a single probe (referred to as imager in the multiplexed imaging literature), rather than a pool of probes, can be present in a sample at a given time². As such, one need not consider binding between probes. Analogously, in DNA similarity search, a single ‘query’ probe is used to bind ‘target’ strands, so preventing binding between probe strands is not necessary³⁰.

In some applications, where orthogonal sequence libraries and their reverse complements are mixed together in a single reaction, such as in multiplexed PCR³¹, a stronger definition of crosstalk is required. We call this orthogonality including reverse complements, where A and B have crosstalk if any pair of A, A*, B, B* have substantial binding (other than the desired A with A*, and B with B*).

For all cases above, SSM is an applicable heuristic for orthogonality. While other heuristics are stronger for certain applications (Supplementary Notes 6 and 7), in this paper, we consider only SSM, as it enables scalable sequence design via mathematical abstraction.

Proof of maximal library size under SSM constraints

Definitions

Sequence library: set of sequences of length L over alphabet of size m
k-mer: subsequence of length k
SSM satisfied for length k: no subsequence of length k appears more than once, for k ≤ L
Maximally sized SSM sequence library: a sequence library satisfying SSM for length k with size such that no larger sequence library satisfying SSM for length k exists.

Lemma 1

A maximally sized sequence library that satisfies SSM for length k contains at most m^k distinct k-mers.

Proof of Lemma 1

Assume for the sake of contradiction that there exists an SSM satisfying library for length k, which has K > m^k k-mers. Since there are only m^k possible k-mers, by the pigeonhole principle, at least one k-mer must appear >1 time in the library. Since a k-mer appears more than once in the library, it does not satisfy SSM. We have arrived at a contradiction.

Theorem 1

A sequence library generated by the partitioning of a Hamiltonian path in a k de Bruijn graph is a maximally sized SSM sequence library for length k.

Proof of Theorem 1

By definition, the number of k-mers in such a library is equal to the number of nodes in the corresponding de Bruijn graph. The number of nodes in the de Bruijn graph, by definition, is m^k. By Lemma 1, a maximally sized sequence library that satisfies SSM for length contains at most m^k k-mers. Thus, no larger SSM satisfying library exists.

Orthogonality with reverse complements

In the seqwalk package, we implement three different strategies for orthogonal sequence design that considers reverse complementarity.

For the case of three-letter alphabets and odd SSM k values, we describe an efficient algorithm for filtering out reverse complementary sequences. Without loss of generality, consider the case of sequences constructed with an A, C, T library.

We want a library with no repeated k-mers, and no k-mers whose reverse complement also appears in the library. k-mers containing C cannot have their reverse complements also appear in the library, since the library will not contain G. So, we only need to consider k-mers composed entirely of A and T.

To use k-mers whose reverse complement will not appear in the library, we partition all AT k-mers into two sets, such that the reverse complement of each sequence in one set appears in the other set. Then, we can remove all sequences containing k-mers from one partition. Thus, the reverse complements of any k-mers that appear in the library will not be present.

For odd k, we can easily find a partitioning by noting that the middle base in the k-mer must be different from its reverse complement. For example, in a 5-mer, the third base can never be the same as the third base of its reverse complement. So, we can simply divide k-mers into two sets according to the identity of its middle base.

Using this approach, we can easily lower bound the size of a resulting library will be upon filtering. We know that there are 2^k k-mers consisting entirely of A and T. Half of these k-mers will have A as the middle base. At most, we will remove one sequence from the library for each k-mer with A as the middle base. As such, we can lower-bound the number of sequences upon reverse complementarity filtering, N_rc using

$${N}_{{\mathrm{rc}}}\ge N-{2}^{\rm{k-1}}.$$

This theoretical result indicates that SeqWalk still produces relatively large sequence libraries upon such filtering. For example, for the case of 25-nt barcodes with three-letter code, SSM k = 13, and removal of reverse complements, we will have a sequence library with at least ${N}_{{\mathrm{rc}}}\ge \frac{{3}^{13}}{13}-{2}^{12}=1.18544\times 1{0}^{5}$ sequences.

In the case of a four-letter alphabet, filtering is more difficult because we cannot constrain reverse complementary k-mers to AT sequences. For odd-k and four-letter codes, we use a modification of the Hierholzer algorithm, in which we mark both the visited k-mer and its reverse complement ‘visited’ during traversal. This method requires keeping track of visited nodes and, as such, is less time/memory efficient than the shift rule traversal. Our implementation can be found in the adapted_hierholzer function in the generation module of the seqwalk source code.

For even-k, we implement a simple hashing-based approach to filter out reverse complements. We iterate through each sequence in a SSM-satisfying (without considering reverse complements) library, and if it has a k-mer that matches the reverse complements of previous sequences in the library, we remove the sequence from the library.

Benchmarking DeLOB performance

DeLOB⁷ and seqwalk design libraries using similar, but not identical, design constraints. DeLOB uses the presence BLAST high-score segment pairings (HSP) of length 13 or more as heuristic for crosstalk. Based on the BLAST parameters used in DeLOB, an HSP must contain at least 11 bases of exact match. This means the SSM of k = 12 is at least as strong of an orthogonality criterion as used in DeLOB. As such, we compare DeLOB to seqwalk libraries designed with the k = 12 constraint. DeLOB, as presented in ref. ⁷, also filters sequences for melting temperature, secondary structure and absence of restriction sites. To more directly compare DeLOB and seqwalk, we reimplemented DeLOB with no additional sequence filtering beyond orthogonality and used seqwalk with no additional sequence filtering.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Source data for Fig. 2 and Extended Data Fig. 1 are made available with this manuscript. The numerical data supporting the findings of this paper are provided in the source data files, and the sequences themselves can be generated by running our software. Source data are provided with this paper.

Code availability

The code to reproduce all analysis from this paper can be found via GitHub at github.com/ggdna/seqwalk_paper_reproducibility. The software library can be found via Zenodo at https://doi.org/10.5281/zenodo.10932482 ref. ³² and installed from https://pypi.org/project/seqwalk/.

References

Sawada, J., Williams, A. & Wong, D. A simple shift rule for k-ary de bruijn sequences. Discrete Math. 340, 524–531 (2017).
Article MathSciNet Google Scholar
Saka, S. K. et al. Immuno-SABER enables highly multiplexed and amplified protein imaging in tissues. Nat. Biotechnol. 37, 1080–1090 (2019).
Article Google Scholar
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Article Google Scholar
Gartner, Z. J. & Liu, D. R. The generality of DNA-templated synthesis as a basis for evolving non-natural small molecules. J. Am. Chem. Soc. 123, 6961–6963 (2001).
Article Google Scholar
Casini, A. et al. R2oDNA designer: computational design of biologically neutral synthetic DNA sequences. ACS Synth. Biol. 3, 525–528 (2014).
Article Google Scholar
Yu, T. C. et al. Multiplexed characterization of rationally designed promoter architectures deconstructs combinatorial logic for IPTG-inducible systems. Nat. Commun. 12, 325 (2021).
Article Google Scholar
Xu, Q., Schlabach, M. R., Hannon, G. J. & Elledge, S. J. Design of 240,000 orthogonal 25mer DNA barcode probes. Proc. Natl. Acad. Sci. USA 106, 2289–2294 (2009).
Article Google Scholar
Marathe, A., Condon, A. E. & Corn, R. M. On combinatorial DNA word design. J. Comput. Biol. 8, 201–219 (2001).
Article Google Scholar
Kishi, J. Y., Schaus, T. E., Gopalkrishnan, N., Xuan, F. & Yin, P. Programmable autonomous synthesis of single-stranded DNA. Nat. Chem. 10, 155–164 (2018).
Article Google Scholar
Evans, C. G. & Winfree, E. in DNA Computing and Molecular Programming (eds. Soloveichik, D. & Yurke, B.) 61–75 (Springer, 2013).
Fornace, M. E., Porubsky, N. J. & Pierce, N. A. A unified dynamic programming framework for the analysis of interacting nucleic acid strands: enhanced models, scalability, and speed. ACS Synth. Biol. 9, 2665–2678 (2020).
Article Google Scholar
Seeman, N. C. De novo design of sequences for nucleic acid structural engineering. J. Biomol. Struct. Dyn. 8, 573–581 (1990).
Article Google Scholar
Shoemaker, D. D., Lashkari, D. A., Morris, D., Mittmann, M. & Davis, R. W. Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar-coding strategy. Nat. Genet. 14, 450–456 (1996).
Article Google Scholar
He, Z., Wu, L., Li, X., Fields, M. W. & Zhou, J. Empirical establishment of oligonucleotide probe design criteria. Appl. Environ. Microbiol. 71, 3753–3760 (2005).
Article Google Scholar
Kane, M. D. et al. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 28, 4552–4557 (2000).
Article Google Scholar
Beliveau, B. J. et al. OligoMiner provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ hybridization probes. Proc. Natl. Acad. Sci. USA 115, E2183–E2192 (2018).
Article Google Scholar
Booeshaghi, A. S., Min, KyungHoiJoseph, Gehring, J. & Pachter, L. Quantifying orthogonal barcodes for sequence census assays. Bioinform. Adv. 4, vbad181 (2024).
Article Google Scholar
Smith, W. D. & Schweitzer, A. in DIMACS Series in Discrete Mathematics and Theoretical Computer Science (eds. Lipton, R. J. & Baum, E.) 121–185. (American Mathematical Society, 1996).
Kozyra, J. et al. Designing uniquely addressable bio-orthogonal synthetic scaffolds for DNA and RNA origami. ACS Synth. Biol. 6, 1140–1149 (2017).
Article Google Scholar
Kozak, A., Głowacki, T. & Formanowicz, P. A method for constructing artificial DNA libraries based on generalized de bruijn sequences. Discrete Appl. Math. 259, 127–144 (2019).
Article MathSciNet Google Scholar
Sawada, J., Williams, A. & Wong, D. A surprisingly simple de Bruijn sequence construction. Discrete Math. 339, 127–131 (2016).
Article MathSciNet Google Scholar
van Aardenne-Ehrenfest, T. & de Bruijn, N. G. in Classic Papers in Combinatorics (eds. Gessel, I. & Rota, G. C.) 149–163 (Springer, 2009).
Karp, R. M. in Complexity of Computer Computations: Proceedings of a Symposium on the Complexity of Computer Computations (eds Miller, R. E., Thatcher, J. W. & Bohlinger, J. D.) 85–103 (Springer, 1972).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).
Article Google Scholar
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
Article Google Scholar
Hierholzer, C. & Wiener, C. Über die Möglichkeit, einen Linienzug ohne Wiederholung und ohne Unterbrechung zu umfahren. Math. Ann. 6, 30–32 (1873).
Article MathSciNet Google Scholar
Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995).
Article Google Scholar
Linder, J. & Seelig, G. Fast activation maximization for molecular sequence design. BMC Bioinform. 22, 510 (2021).
Article Google Scholar
Weinstein, E. N. et al. Optimal design of stochastic DNA synthesis protocols based on generative sequence models. In Proc. 25th International Conference on Artificial Intelligence and Statistics, Vol. 151 of Proc. Machine Learning Research, (eds Camps-Valls, G., Ruiz, F. J. R. & Valera, I.) 7450–7482 (PMLR, 2022).
Bee, C. et al. Molecular-level similarity search brings computing to DNA data storage. Nat. Commun. 12, 4764 (2021).
Article Google Scholar
Xie, N. G. et al. Designing highly multiplex PCR primer sets with simulated annealing design using dimer likelihood estimation (SADDLE). Nat. Commun. 13, 1881 (2022).
Article Google Scholar
Gowri, G. ggdna/seqwalk: v0.3.1 (v0.3.1). Zenodo https://doi.org/10.5281/zenodo.10932482 (2024).

Download references

Acknowledgements

We thank J. Kishi, T. Brailovskaya and E. Winfree for thoughtful discussions. We thank X. Lun, N. Liu and peer reviewers for feedback on the manuscript. We thank the Jupyter Project for maintaining open-source computational tools. This work is supported by the National Institute of Health (grants DP1GM133052, RF1MH128861, R01GM124401 and R01HG012926) and Wyss Institute’s Molecular Robotics Initiative.

Author information

Authors and Affiliations

Department of Systems Biology, Harvard Medical School, Boston, MA, USA
Gokul Gowri, Kuanwei Sheng & Peng Yin
Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, MA, USA
Gokul Gowri, Kuanwei Sheng & Peng Yin

Authors

Gokul Gowri
View author publications
You can also search for this author in PubMed Google Scholar
Kuanwei Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Peng Yin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.G. conceived the study, designed the algorithms, developed the software and wrote the manuscript. K.S. conceived the study, provided conceptual guidance and wrote the manuscript. P.Y. conceived and supervised the study, provided technical and conceptual guidance, and wrote the manuscript.

Corresponding authors

Correspondence to Gokul Gowri or Peng Yin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Masami Hagiya and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Extended data

Extended Data Fig. 1 Depiction of seqwalk software package.

(a) Example design code which produces a library of at least 200 25nt sequences with maximal orthogonality according to the SSM heuristic. (b) Output of example design code. (c) Crosstalk analysis of designed library using Hamming distance. Each row/column represents a sequence, and each entry is colored by Hamming distance.

Source data.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Table 1 and Notes 1–7.

Reporting Summary

Source data

Source Data Fig. 2

Numerical performance data for sequence design software.

Source Data Extended Data Fig. 1

Hamming distance matrix of depicted sequence library.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gowri, G., Sheng, K. & Yin, P. Scalable design of orthogonal DNA barcode libraries. Nat Comput Sci 4, 423–428 (2024). https://doi.org/10.1038/s43588-024-00646-z

Download citation

Received: 11 July 2022
Accepted: 15 May 2024
Published: 07 June 2024
Issue Date: June 2024
DOI: https://doi.org/10.1038/s43588-024-00646-z
Springer Nature America, Inc.

Scalable design of orthogonal DNA barcode libraries

Abstract

Similar content being viewed by others

Main

k-mer graphs for orthogonal sequence design

Performance benchmarks

Sequence design under additional constraints

Theoretical results

Implementation as a software tool

Discussion

Methods

Clarifying notions of orthogonality

Proof of maximal library size under SSM constraints

Definitions

Lemma 1

Proof of Lemma 1

Theorem 1

Proof of Theorem 1

Orthogonality with reverse complements

Benchmarking DeLOB performance

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation