Abstract
DNA barcodes are frequently corrupted due to insertion, deletion, and substitution errors during DNA synthesis, amplification and sequencing, resulting in index hopping. In this paper, we propose a new DNA barcode construction scheme that combines a cyclic block code with a predetermined pseudo-random sequence bit by bit to form bit pairs, and then converts the bit pairs to bases, i.e., the DNA barcodes. Then, we present a barcode identification scheme for noisy sequencing reads, which uses a combination of cyclic shifting and traditional dynamic programming to mark the insertion and deletion positions, and then performs erasure-and-error-correction decoding on the corrupted codewords. Furthermore, we verify the identification error rate of barcodes for multiple errors and evaluate the reliability of the barcodes in DNA context. This method can be easily generalized for constructing long barcodes, which may be used in scenarios with serious errors. Simulation results show that the bit error rate after identifying insertions/deletions is greatly reduced using the combination of cyclic shift and dynamic programming compared to using dynamic programming only. It indicates that the proposed method can effectively improve the accuracy for estimating insertion/deletion errors. And the overall identification error rate of the proposed method is lower than \(10^{ - 5}\) when the probability of each base mutation is less than 0.1, which is the typical scenario in third-generation sequencing.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13205-020-02607-5/MediaObjects/13205_2020_2607_Fig9_HTML.png)
Similar content being viewed by others
References
Adey A, Morrison HG, Xun X et al (2010) Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol 11(12):R119. https://doi.org/10.1186/gb-2010-11-12-r119
Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res 46(5):2159–2168. https://doi.org/10.1093/nar/gky066
Ashlock D, Guo L, Qiu F (2002) Greedy closure evolutionary algorithms. In: Proceedings of 2002 Congress on evolutionary computation 2:1296–1301. https://doi.org/10.1109/CEC.2002.1004430
Buschmann T, Bystrykh LV (2013) Levenshtein error-correcting barcodes for multiple-xed DNA sequencing. BMC Bioinform 14:272–273. https://doi.org/10.1186/1471-2105-14-272
Chen WG, Huang G, Li BZ, Yin Y, Yuan YJ (2020a) DNA information storage for audio and video files (in Chinese). Sciia Sinica Vitae 50:81–85. https://doi.org/10.1360/ssv-2019-0211
Chen WG, Wang LX, Han MZ, Han CC, Li BZ (2020b) Sequencing barcode construction and identification methods based on block error-correction codes. Sci China Life Sci 63(10):1580–1592. https://doi.org/10.1007/s11-427-019-1651-3
Costello M, Fleharty M, Abreu J et al (2018) Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms. BMC Genom 19:332. https://doi.org/10.1186/s1-2864-018-4703-0
Davey MC, Mackay DJC (2001) Reliable communication over channels with insertions, deletions, and substitutions. IEEE Trans Inf Theory 47:687–698. https://doi.org/10.1109/18.910582
Eisenstein M (2019) Playing a long game. Nat Methods 16(8):683–686. https://doi.org/10.1038/s41592-019-0507-7
Griffiths JA, Richard AC, Karsten B, Lun AT, Marioni JC (2018) Detection and removal of barcode swapping in single-cell RNA-seq data. Nat Commun 9:2667. https://doi.org/10.1038/s41467-018-05083-x
Hamady M, Walker JJ, Harris JK, Gold NJ, Knihht R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods 5:235–237. https://doi.org/10.1038/nmeth.1184
Haughton D, Balado F (2013) A modified watermark synchronization code for robust embedding of data in DNA. IEEE Intl Conf Acoust Speech Signal Process. https://doi.org/10.1109/icassp.2013.66378-30
Hawkins J, Jones SK, Finkelstein IJ et al (2018) Indel-correcting DNA barcodes for high-throughput sequencing. Proc Natl Acad Sci 115:6217–6226. https://doi.org/10.1073/pnas.1802640115
Jain M, Koren S, Miga KH, Quick J (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338–345
Kracht D, Schober S (2015) Insertion and deletion correcting DNA barcodes based on watermarks. BMC Bioinform 16:1–14. https://doi.org/10.1186/s12859-015-0482-7
Krishnan AR, Sweeney M, Vasic J, Galbraith DW, Vasic B (2011) Barcodes for DNA sequencing with guaranteed error correction capability. Electron Lett. 47:237. https://doi.org/10.1049/el.2010.3546
Kruskal JB (1983) An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev 25:201–237. https://doi.org/10.1137/1025045
Larsson AJM, Stanley G, Sinha R et al (2018) Computational correction of index switching in multiplexed sequencing libraries. Nat Methods 15:305–307. https://doi.org/10.1038/nmeth.4666
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8):707–710
Likhitha CP, Ninitha P, Kanchana V (2016) DNA bar-coding: a novel approach for identifying an individual using extended Levenshtein distance algorithm and STR analysis. Int J Electric Comput Eng 6:1133–1139. https://doi.org/10.11591/ijece.v6i3.10086
Lin S, Costello DJ (2001) Error control coding (2nd Edition). Prentice Hall, New York, pp 194–231
Liu Y, Chen WG (2016) A hard-decision iterative decoder for the Davey-MacKay construction with symbol-level inner decoder. Electron Lett 52:1026–1028. https://doi.org/10.1049/el.2016.0365
Liu Y, Chen WG (2017) Decoding on adaptively pruned trellis for correcting synchronization errors. China Commun 14:163–171. (https://doi.org/10.11-09/CC.2017.8010967)
Liu Y, Chen WG (2018) An iterative decoding scheme for Davey-MacKay construction. China Commun 15:187–195. https://doi.org/10.1109/cc.2018.8398515
Lyons E, Sheridan P, Tremmel G et al (2017) Large-scale DNA barcode library generation for biomolecule identification in high-throughput screens. Sci Rep 7:13899. https://doi.org/10.1038/s41598-017-12825-2
Minoche AE, Dohmr JC, Himmelbauer H (2011) Evaluation of genomic high-throughput sequencing data generated on illumina hiseq and genome analyzer systems. Genome Biol 12:112. https://doi.org/10.1186/gb-2011-12-11-r112
Parameswaran P, Jalili R, Tao L et al (2007) A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexing. Nucleic Acids Res 35:130. https://doi.org/10.1093/nar/gkm760
Somervuo P, Koskinen P, Mei P et al (2018) BARCOSEL: a tool for selecting an optimal barcode set for high-throughput sequencing. BMC Bioinform 19:257. https://doi.org/10.1093/nar/gkm760
Tambe A, Pachter L (2019) Barcode identification for single cell genomics. BMC Bioinform 20(1):1–9. https://doi.org/10.1101/136242
Vodák D, Lorenz S, Nakken S et al (2018) Sample-index misassignment impacts tumour exome sequencing. Sci Rep 8:5307. https://doi.org/10.1038/s41598-018-23563-4
Wand NO, Smith DA, Wilkinson AA et al (2019) DNA barcodes for rapid, whole genome, single-molecule analyses. Nucleic Acids Res 47:68. https://doi.org/10.1093/nar/gkz212
Acknowledgements
We thank the National Natural Science Foundation of China (61671324) and Seed Foundation of Tianjin University (2019XZY-0038, 2019XYF-0005).
Author information
Authors and Affiliations
Contributions
W.C. designed the study. W.C., P.W., L.W., D.Z., and M.H. performed bioinformatic analyses. P.W. and L.W. performed the simulations, and wrote the manuscript. L.W. and M.H. validated the results. W.C., D.Z., M.H., and L.S. supervised the results, and revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Rights and permissions
About this article
Cite this article
Chen, W., Wang, P., Wang, L. et al. Low-complexity and highly robust barcodes for error-rich single molecular sequencing. 3 Biotech 11, 78 (2021). https://doi.org/10.1007/s13205-020-02607-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13205-020-02607-5