Abstract
Orthogonal DNA barcode library design is an essential task in bioengineering. Here we present seqwalk, an efficient method for designing barcode libraries that satisfy a sequence symmetry minimization (SSM) heuristic for orthogonality, with theoretical guarantees of maximal or near-maximal library size under certain design constraints. Seqwalk encodes SSM constraints in a de Bruijn graph representation of sequence space, enabling the application of recent advances in discrete mathematics1 to the problem of orthogonal sequence design. We demonstrate the scalability of seqwalk by designing a library of >106 SSM-satisfying barcode sequences in less than 20 s on a standard laptop.
Similar content being viewed by others
Main
Orthogonal DNA barcode libraries are widely used in modern biotechnology. For example, orthogonal sequences are used to barcode protein targets in DNA-based bioimaging2, to label RNA molecules in individual cells for single-cell studies3 and to program the assembly of components in a synthesis process4, among many other applications5,6,7,8,9. The number of addressable features (cells, protein targets and so on) in these methods is dependent on the size of the orthogonal DNA sequence library that is used. As such, the problem of designing large orthogonal DNA sequence libraries appears across many areas of study in bioengineering.
Depending on the specific application, there are different approaches to designing orthogonal sequence libraries. One powerful approach is using physical models to design sequences such that off-target interactions are thermodynamically unfavorable10. Currently, scaling thermodynamic design tools to massive libraries (for example, exceeding 105 nucleotides) or exhaustive searches of large sequence space (for example, all 425 possible 25-nt sequences) is prohibitively computationally expensive11.
One widely used alternative is sequence symmetry minimization (SSM)4,7,12,13,14,15,16,17. A set of sequences is considered to satisfy SSM for length k if no subsequence of length k appears more than one time in the set. In technologies with sequencing-based barcode readouts, satisfying SSM decreases the likelihood of incorrectly assigning barcodes17. In technologies with hybridization-based barcode readouts, satisfying SSM decreases the probability of off-target binding14,15. It is important to note that, while an informative heuristic, SSM does not explicitly capture thermodynamic properties of sequences, and cannot guarantee low off-target binding energies (Supplementary Notes 6 and 7).
Sequence symmetry has the appealing property that it can be mathematically represented using de Bruijn graphs18,19,20. In this Brief Communication, we show that this graph representation of sequence space enables a massively scalable approach to DNA barcode design. We build on recent advances in discrete mathematics1,21 to develop seqwalk, an efficient tool for designing SSM-satisfying barcode libraries, and provide theoretical bounds on orthogonal library size under various design constraints. We provide accessible software implementations of seqwalk and show that it is capable of designing >106 25-nt barcode sequences in less than 20 s on a single standard central processing unit (CPU) core, with provable guarantees of maximal library size under an SSM constraint.
k-mer graphs for orthogonal sequence design
The key observation underlying seqwalk is that orthogonality constraints in sequence design problems can be naturally encoded in de Bruijn graph representations of sequence space18,19,20. De Bruijn graphs, also known as k-mer graphs, are sequence representations that have been well studied in discrete mathematics1,21,22. A k-mer is a length k sequence. A k-mer graph has all possible k-mers as nodes, and edges between k-mers that overlap by k − 1 symbols. In particular, if a k-mer k1 can be transformed into a k-mer k2 by removing its first symbol and appending a symbol, then there is a directed edge from k1 to k2.
On a k-mer graph, a length L sequence can be represented as a path over L − k + 1 nodes. The traversed nodes will correspond to each k-mer that appears in the sequence. A set of sequences that can be represented as nonintersecting paths on a k-mer graph share no common k-mers and, thus, satisfy SSM for the corresponding k. This points toward a method for generating sequences that implicitly satisfy SSM for length k: one can simply select several nonintersecting paths on a k-mer graph. One way to produce nonintersecting paths on a graph is to take a single self-avoiding walk and then partition this walk into multiple nonintersecting paths. The longest possible self-avoiding walks on a graph are Hamiltonian paths, which visit every node of the graph exactly one time. A partitioned Hamiltonian path will result in sequences that fully occupy k-mer space and, thus, yield maximally sized orthogonal sequence libraries.
In seqwalk, we apply a recently discovered mathematical technique for traversing de Bruijn graphs, which yields Hamiltonian paths in amortized O(1) time and memory per node1, to efficiently and scalably design orthogonal sequence libraries (Fig. 1). While finding Hamiltonian paths in arbitrary graphs is computationally hard23, the mathematical structure of de Bruijn graphs enables efficient identification of Hamiltonian paths. The core algorithm is simple: our implementation requires less than 100 lines of code, including output formatting.
(1) Select length of desired sequences L, length of prevented substrings (SSM constraint) k, and allowed alphabet a. (2) Represent sequence space using a k-mer graph. (3) Select a Hamiltonian path in k-mer graph. (4) Partition the selected Hamiltonian path into fragments with L − k + 1 nodes. (5) Map fragments to corresponding sequences.
Performance benchmarks
To understand the practical relevance of the efficiency of seqwalk, we compare its time and memory efficiency with DeLOB7, an existing approach for large orthogonal library design. In brief, DeLOB begins with a large candidate library of random sequences, uses BLAST24 to identify pairs of sequences that violate SSM with k = 12, then chooses a subset of sequences that do not violate SSM. While DeLOB candidate libraries can be refined on the basis of sequence design constraints (such as melting temperature or lack of intramolecular secondary structure), for the purpose of benchmarking, we reimplemented DeLOB with unconstrained candidate libraries of random sequences (Methods).
We run DeLOB with various numbers of candidate sequences and compare this with seqwalk run with an SSM constraint of k = 12. We find that seqwalk produces about two orders of magnitude more sequences than DeLOB run for a similar time (Fig. 2). In our benchmarking setup, DeLOB has a peak memory usage of nearly 100 GB to design about 3.6 × 104 sequences, making it incompatible with current personal computing hardware. In comparison, seqwalk has peak memory usage of less than 1 GB and produces over 106 sequences. In summary, seqwalk is capable of efficiently producing SSM-satisfying sequence libraries, requiring only standard personal computing hardware to design libraries exceeding 106 sequences in less than 30 s.
DeLOB was run with varying choices for number of candidate sequences from 103 to 105, resulting in a range of computational costs and library sizes. All computations were performed using a single CPU core.
Seqwalk’s exhaustive traversal of sequence space is also useful for designing small orthogonal libraries with minimal sequence symmetry. We demonstrate this by comparing a seqwalk library with a widely used multiplexing barcode library for a single-cell RNA sequencing method, MULTI-seq25. The original MULTI-seq library consists of nine distinct 8-nt barcodes, designed to have pairwise Hamming distances ≥3. This design strategy yields barcodes with high sequence symmetry, resulting in barcode ambiguity that may give rise to experimental artifacts17. The seqwalk equivalent of this library, designed with the smallest k to yield at least nine sequences (k = 3), has minimal barcode ambiguity, lower homopolymer prevalence, improved pairwise Hamming distances and similar guanine-cytosine (GC) diversity (Supplementary Note 5 and Supplementary Table 1).
Sequence design under additional constraints
In many applications, there are additional constraints to orthogonal sequence libraries beyond crosstalk between barcodes. One common constraint is the prevention of crosstalk with reverse complements of sequences in the library. For sequence design under this constraint, seqwalk integrates two approaches: a filtering approach, and an adaptation of the Hierholzer algorithm26 for four-letter libraries with odd k (Methods).
Seqwalk design can also consider other common constraints such as requiring GC content or melting temperature within a window, the absence of specific sequence patterns and the absence of substantial secondary structure. We provide efficient algorithms for filtering seqwalk libraries for these characteristics (Supplementary Notes 1–4). We find that three-letter seqwalk libraries are particularly amenable to such filtering, as they have sequences with lower variance in GC content and melting temperature (Supplementary Figs. 1 and 2), low prevalence of secondary structure (Supplementary Fig. 3) and less crosstalk with reverse complements (Methods).
Theoretical results
SSM-satisfying sequence libraries designed by the partitioning of a Hamiltonian path (such as in seqwalk) are maximally sized. This can be trivially proven by contradiction, by noting that every possible k-mer in the sequence space appears in the library. If there existed a larger library of SSM-satisfying sequences, it would use a larger number of k-mers and, thus, would repeat k-mers and not satisfy SSM (Methods).
Fundamental results about de Bruijn graphs22 almost directly yield a closed form expression for the number of sequences in seqwalk libraries under different design parameters. For alphabet size m, sequence length L and SSM constraint k, the number of possible orthogonal sequences N is the number of nodes in the k-mer graph divided by the number of nodes required to represent a sequence of length L. More precisely,
This theoretical result has practical relevance. For a practitioner who wishes to design a certain number of sequences, the strongest possible SSM constraint (that is, the smallest possible k) can be determined using the relationship between k and N. Given a desired library size of Nd, sequence length L and alphabet size m, we can choose the smallest k such that N ≥ Nd. Designing a library using the resulting k value yields a maximally orthogonal (as defined by SSM) library with the desired number of sequences. This function is implemented in the seqwalk software library and named max_orthogonality (Extended Data Fig. 1).
While the proof of maximal library size does not hold under additional design constraints (such as reverse complement prevention, GC content filtering and so on), we can estimate or exactly state useful lower bounds on the size of seqwalk libraries after downstream filtering. For example, we can place lower bounds on the number of sequences present after a filtering for a specific sequence pattern of length p ≤ k. The number of k-mers containing a specific pattern of length p is
where m is the size of the alphabet. Since no k-mer appears in more than one sequence in the library, we must remove at most Kp sequences from our library to remove all sequences containing a pattern of length p. As such, the size of the filtered library, Np, is
Such lower bounds are simple to determine for practically relevant pattern constraints, such as the prevention of homopolymeric regions (Supplementary Note 4). For certain choices of k, L and p, the size of pattern-free seqwalk libraries is near identical to the maximum possible library size under no pattern constraint. For example, for patterns with length p = k, at most one sequence is removed per pattern.
Additionally, we derive a lower bound on the size of seqwalk libraries upon filtering for orthogonality with reverse complements (Methods). For the case of three-letter libraries with odd k, we show that the size of a seqwalk library that satisfies orthogonality with reverse complements, Nrc, can be bounded by
The size of seqwalk libraries under GC content constraints is not as easily determined analytically. However, empirical results show that seqwalk libraries have consistent distributions of GC content, resembling the binomial distribution expected of uniformly random sequences (Supplementary Note 1). As such, these distributions can be used to estimate the size of seqwalk libraries under GC content constraints.
Implementation as a software tool
We have implemented the seqwalk algorithm and additional filtering tools in a ‘pip’ distributed Python package (seqwalk, documented at seqwalk.readthedocs.io). Additionally, we have developed an interactive, code-free, web-based seqwalk interface in a publicly accessible Google Colaboratory notebook (link on seqwalk.readthedocs.io), based on a Julia implementation. While the Julia implementation is faster, the Python implementation and package allow for easier incorporation with the existing ecosystem of tools for sequence design and analysis. We envision the use of seqwalk as a part of a sequence design pipeline, with downstream filtering (experimental validation, genomic homology filtering and so on) as necessary for specific application contexts. Due to the simplicity of the underlying algorithms, we expect that others can implement our design method in other settings and modify it as necessary for different design pipelines.
Discussion
In this paper, we introduced seqwalk, a method for scalably designing DNA barcode libraries that satisfy SSM constraints. Seqwalk enables the design of SSM-satisfying libraries consisting of millions of sequences, using only standard personal computing hardware.
While seqwalk can be applied to many design problems, its use of the SSM heuristic makes it more directly applicable in certain experimental contexts. In particular, seqwalk is well suited for problems where nuanced biophysical properties (that is, exact ΔG) do not need to be tightly controlled (Supplementary Notes 6 and 7). In settings where biophysical or other experimental design constraints are strong, seqwalk can be used upstream of other design tools as a way to quickly constrain design space on the basis of an SSM heuristic. We expect that seqwalk can be valuable, either alone or in conjunction with other sequence design tools, for the rapidly growing class of high-throughput biological methods that use synthetic DNA sequences as barcodes for different biomolecular features (that is, samples, cells, protein targets, plasmids and so on).
Additionally, the theoretical guarantees on the size of seqwalk libraries can be used to guide design choices in experimental method development. Using the results presented in this paper, one can quickly assess tradeoffs between design parameters and orthogonal sequence library size.
At least two threads of future investigation are raised by this work. First, the graph representation used in seqwalk only captures orthogonality as defined by SSM. Is it possible to generalize the approach to other notions of orthogonality, such as those defined by physical models? Second, as SSM remains an appealing orthogonality heuristic for its tractability, can we precisely identify experimental settings where it is insufficient?
Graph representations of sequences are commonly used to describe naturally occurring biological sequences27. There is growing interest in sequence representations amenable to design tasks, in addition to descriptive tasks28,29. With seqwalk, we demonstrate that graph-based sequence representations enable massive efficiency improvements in SSM-satisfying sequence library design.
Methods
Clarifying notions of orthogonality
Here, we will try to be more precise about what we mean by ‘orthogonality’ and ‘crosstalk’. We will separate the discussion for two broad application categories of DNA barcodes: sequencing-based and hybridization-based.
For sequencing-based barcodes, we consider two barcodes (A and B) to have crosstalk if they cannot be easily disambiguated on the basis of sequencing readout of the barcode. In other words, if barcode A and barcode B can be distorted into the same sequence through the process of library preparation, sequencing and alignment.
For hybridization-based barcodes, we consider two sequences (A and B) to have crosstalk if they can stably hybridize with each other’s reverse complements. In other words, if a complex between A and B* or A* and B is likely to form with experimentally relevant propensity, we consider A and B to have crosstalk.
If we think of A and B as probes, with A* and B* being their respective targets, we consider crosstalk to be the binding of a probe to an incorrect target. We do not by default consider binding between A and B to be crosstalk.
For many, but not all, applications, this is a sufficient characterization of crosstalk. In the case of multiplexed DNA exchange imaging, a single probe (referred to as imager in the multiplexed imaging literature), rather than a pool of probes, can be present in a sample at a given time2. As such, one need not consider binding between probes. Analogously, in DNA similarity search, a single ‘query’ probe is used to bind ‘target’ strands, so preventing binding between probe strands is not necessary30.
In some applications, where orthogonal sequence libraries and their reverse complements are mixed together in a single reaction, such as in multiplexed PCR31, a stronger definition of crosstalk is required. We call this orthogonality including reverse complements, where A and B have crosstalk if any pair of A, A*, B, B* have substantial binding (other than the desired A with A*, and B with B*).
For all cases above, SSM is an applicable heuristic for orthogonality. While other heuristics are stronger for certain applications (Supplementary Notes 6 and 7), in this paper, we consider only SSM, as it enables scalable sequence design via mathematical abstraction.
Proof of maximal library size under SSM constraints
Definitions
-
Sequence library: set of sequences of length L over alphabet of size m
-
k-mer: subsequence of length k
-
SSM satisfied for length k: no subsequence of length k appears more than once, for k ≤ L
-
Maximally sized SSM sequence library: a sequence library satisfying SSM for length k with size such that no larger sequence library satisfying SSM for length k exists.
Lemma 1
A maximally sized sequence library that satisfies SSM for length k contains at most mk distinct k-mers.
Proof of Lemma 1
Assume for the sake of contradiction that there exists an SSM satisfying library for length k, which has K > mk k-mers. Since there are only mk possible k-mers, by the pigeonhole principle, at least one k-mer must appear >1 time in the library. Since a k-mer appears more than once in the library, it does not satisfy SSM. We have arrived at a contradiction.
Theorem 1
A sequence library generated by the partitioning of a Hamiltonian path in a k de Bruijn graph is a maximally sized SSM sequence library for length k.
Proof of Theorem 1
By definition, the number of k-mers in such a library is equal to the number of nodes in the corresponding de Bruijn graph. The number of nodes in the de Bruijn graph, by definition, is mk. By Lemma 1, a maximally sized sequence library that satisfies SSM for length contains at most mk k-mers. Thus, no larger SSM satisfying library exists.
Orthogonality with reverse complements
In the seqwalk package, we implement three different strategies for orthogonal sequence design that considers reverse complementarity.
For the case of three-letter alphabets and odd SSM k values, we describe an efficient algorithm for filtering out reverse complementary sequences. Without loss of generality, consider the case of sequences constructed with an A, C, T library.
We want a library with no repeated k-mers, and no k-mers whose reverse complement also appears in the library. k-mers containing C cannot have their reverse complements also appear in the library, since the library will not contain G. So, we only need to consider k-mers composed entirely of A and T.
To use k-mers whose reverse complement will not appear in the library, we partition all AT k-mers into two sets, such that the reverse complement of each sequence in one set appears in the other set. Then, we can remove all sequences containing k-mers from one partition. Thus, the reverse complements of any k-mers that appear in the library will not be present.
For odd k, we can easily find a partitioning by noting that the middle base in the k-mer must be different from its reverse complement. For example, in a 5-mer, the third base can never be the same as the third base of its reverse complement. So, we can simply divide k-mers into two sets according to the identity of its middle base.
Using this approach, we can easily lower bound the size of a resulting library will be upon filtering. We know that there are 2k k-mers consisting entirely of A and T. Half of these k-mers will have A as the middle base. At most, we will remove one sequence from the library for each k-mer with A as the middle base. As such, we can lower-bound the number of sequences upon reverse complementarity filtering, Nrc using
This theoretical result indicates that SeqWalk still produces relatively large sequence libraries upon such filtering. For example, for the case of 25-nt barcodes with three-letter code, SSM k = 13, and removal of reverse complements, we will have a sequence library with at least \({N}_{{\mathrm{rc}}}\ge \frac{{3}^{13}}{13}-{2}^{12}=1.18544\times 1{0}^{5}\) sequences.
In the case of a four-letter alphabet, filtering is more difficult because we cannot constrain reverse complementary k-mers to AT sequences. For odd-k and four-letter codes, we use a modification of the Hierholzer algorithm, in which we mark both the visited k-mer and its reverse complement ‘visited’ during traversal. This method requires keeping track of visited nodes and, as such, is less time/memory efficient than the shift rule traversal. Our implementation can be found in the adapted_hierholzer function in the generation module of the seqwalk source code.
For even-k, we implement a simple hashing-based approach to filter out reverse complements. We iterate through each sequence in a SSM-satisfying (without considering reverse complements) library, and if it has a k-mer that matches the reverse complements of previous sequences in the library, we remove the sequence from the library.
Benchmarking DeLOB performance
DeLOB7 and seqwalk design libraries using similar, but not identical, design constraints. DeLOB uses the presence BLAST high-score segment pairings (HSP) of length 13 or more as heuristic for crosstalk. Based on the BLAST parameters used in DeLOB, an HSP must contain at least 11 bases of exact match. This means the SSM of k = 12 is at least as strong of an orthogonality criterion as used in DeLOB. As such, we compare DeLOB to seqwalk libraries designed with the k = 12 constraint. DeLOB, as presented in ref. 7, also filters sequences for melting temperature, secondary structure and absence of restriction sites. To more directly compare DeLOB and seqwalk, we reimplemented DeLOB with no additional sequence filtering beyond orthogonality and used seqwalk with no additional sequence filtering.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Source data for Fig. 2 and Extended Data Fig. 1 are made available with this manuscript. The numerical data supporting the findings of this paper are provided in the source data files, and the sequences themselves can be generated by running our software. Source data are provided with this paper.
Code availability
The code to reproduce all analysis from this paper can be found via GitHub at github.com/ggdna/seqwalk_paper_reproducibility. The software library can be found via Zenodo at https://doi.org/10.5281/zenodo.10932482 ref. 32 and installed from https://pypi.org/project/seqwalk/.
References
Sawada, J., Williams, A. & Wong, D. A simple shift rule for k-ary de bruijn sequences. Discrete Math. 340, 524–531 (2017).
Saka, S. K. et al. Immuno-SABER enables highly multiplexed and amplified protein imaging in tissues. Nat. Biotechnol. 37, 1080–1090 (2019).
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Gartner, Z. J. & Liu, D. R. The generality of DNA-templated synthesis as a basis for evolving non-natural small molecules. J. Am. Chem. Soc. 123, 6961–6963 (2001).
Casini, A. et al. R2oDNA designer: computational design of biologically neutral synthetic DNA sequences. ACS Synth. Biol. 3, 525–528 (2014).
Yu, T. C. et al. Multiplexed characterization of rationally designed promoter architectures deconstructs combinatorial logic for IPTG-inducible systems. Nat. Commun. 12, 325 (2021).
Xu, Q., Schlabach, M. R., Hannon, G. J. & Elledge, S. J. Design of 240,000 orthogonal 25mer DNA barcode probes. Proc. Natl. Acad. Sci. USA 106, 2289–2294 (2009).
Marathe, A., Condon, A. E. & Corn, R. M. On combinatorial DNA word design. J. Comput. Biol. 8, 201–219 (2001).
Kishi, J. Y., Schaus, T. E., Gopalkrishnan, N., Xuan, F. & Yin, P. Programmable autonomous synthesis of single-stranded DNA. Nat. Chem. 10, 155–164 (2018).
Evans, C. G. & Winfree, E. in DNA Computing and Molecular Programming (eds. Soloveichik, D. & Yurke, B.) 61–75 (Springer, 2013).
Fornace, M. E., Porubsky, N. J. & Pierce, N. A. A unified dynamic programming framework for the analysis of interacting nucleic acid strands: enhanced models, scalability, and speed. ACS Synth. Biol. 9, 2665–2678 (2020).
Seeman, N. C. De novo design of sequences for nucleic acid structural engineering. J. Biomol. Struct. Dyn. 8, 573–581 (1990).
Shoemaker, D. D., Lashkari, D. A., Morris, D., Mittmann, M. & Davis, R. W. Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar-coding strategy. Nat. Genet. 14, 450–456 (1996).
He, Z., Wu, L., Li, X., Fields, M. W. & Zhou, J. Empirical establishment of oligonucleotide probe design criteria. Appl. Environ. Microbiol. 71, 3753–3760 (2005).
Kane, M. D. et al. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 28, 4552–4557 (2000).
Beliveau, B. J. et al. OligoMiner provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ hybridization probes. Proc. Natl. Acad. Sci. USA 115, E2183–E2192 (2018).
Booeshaghi, A. S., Min, KyungHoiJoseph, Gehring, J. & Pachter, L. Quantifying orthogonal barcodes for sequence census assays. Bioinform. Adv. 4, vbad181 (2024).
Smith, W. D. & Schweitzer, A. in DIMACS Series in Discrete Mathematics and Theoretical Computer Science (eds. Lipton, R. J. & Baum, E.) 121–185. (American Mathematical Society, 1996).
Kozyra, J. et al. Designing uniquely addressable bio-orthogonal synthetic scaffolds for DNA and RNA origami. ACS Synth. Biol. 6, 1140–1149 (2017).
Kozak, A., Głowacki, T. & Formanowicz, P. A method for constructing artificial DNA libraries based on generalized de bruijn sequences. Discrete Appl. Math. 259, 127–144 (2019).
Sawada, J., Williams, A. & Wong, D. A surprisingly simple de Bruijn sequence construction. Discrete Math. 339, 127–131 (2016).
van Aardenne-Ehrenfest, T. & de Bruijn, N. G. in Classic Papers in Combinatorics (eds. Gessel, I. & Rota, G. C.) 149–163 (Springer, 2009).
Karp, R. M. in Complexity of Computer Computations: Proceedings of a Symposium on the Complexity of Computer Computations (eds Miller, R. E., Thatcher, J. W. & Bohlinger, J. D.) 85–103 (Springer, 1972).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
Hierholzer, C. & Wiener, C. Über die Möglichkeit, einen Linienzug ohne Wiederholung und ohne Unterbrechung zu umfahren. Math. Ann. 6, 30–32 (1873).
Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995).
Linder, J. & Seelig, G. Fast activation maximization for molecular sequence design. BMC Bioinform. 22, 510 (2021).
Weinstein, E. N. et al. Optimal design of stochastic DNA synthesis protocols based on generative sequence models. In Proc. 25th International Conference on Artificial Intelligence and Statistics, Vol. 151 of Proc. Machine Learning Research, (eds Camps-Valls, G., Ruiz, F. J. R. & Valera, I.) 7450–7482 (PMLR, 2022).
Bee, C. et al. Molecular-level similarity search brings computing to DNA data storage. Nat. Commun. 12, 4764 (2021).
Xie, N. G. et al. Designing highly multiplex PCR primer sets with simulated annealing design using dimer likelihood estimation (SADDLE). Nat. Commun. 13, 1881 (2022).
Gowri, G. ggdna/seqwalk: v0.3.1 (v0.3.1). Zenodo https://doi.org/10.5281/zenodo.10932482 (2024).
Acknowledgements
We thank J. Kishi, T. Brailovskaya and E. Winfree for thoughtful discussions. We thank X. Lun, N. Liu and peer reviewers for feedback on the manuscript. We thank the Jupyter Project for maintaining open-source computational tools. This work is supported by the National Institute of Health (grants DP1GM133052, RF1MH128861, R01GM124401 and R01HG012926) and Wyss Institute’s Molecular Robotics Initiative.
Author information
Authors and Affiliations
Contributions
G.G. conceived the study, designed the algorithms, developed the software and wrote the manuscript. K.S. conceived the study, provided conceptual guidance and wrote the manuscript. P.Y. conceived and supervised the study, provided technical and conceptual guidance, and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Masami Hagiya and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Extended data
Extended Data Fig. 1 Depiction of seqwalk software package.
(a) Example design code which produces a library of at least 200 25nt sequences with maximal orthogonality according to the SSM heuristic. (b) Output of example design code. (c) Crosstalk analysis of designed library using Hamming distance. Each row/column represents a sequence, and each entry is colored by Hamming distance.
Supplementary information
Supplementary Information
Supplementary Figs. 1–7, Table 1 and Notes 1–7.
Source data
Source Data Fig. 2
Numerical performance data for sequence design software.
Source Data Extended Data Fig. 1
Hamming distance matrix of depicted sequence library.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gowri, G., Sheng, K. & Yin, P. Scalable design of orthogonal DNA barcode libraries. Nat Comput Sci 4, 423–428 (2024). https://doi.org/10.1038/s43588-024-00646-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-024-00646-z
- Springer Nature America, Inc.