Skip to main content

Consensus Decoding of Recurrent Neural Network Basecallers

  • Conference paper
  • First Online:
Algorithms for Computational Biology (AlCoB 2018)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10849))

Included in the following conference series:

  • 517 Accesses

Abstract

There is an extensive literature using probabilistic models, such as hidden Markov models, for the analysis of biological sequences. These models have a clear theoretical basis, and many heuristics have been developed to reduce the time and memory requirements of the dynamic programming algorithms used for their inference. Nevertheless, mirroring the shift in natural language processing, bioinformatics is increasingly seeing higher accuracy predictions made by recurrent neural networks (RNN). This shift is exemplified by basecalling on the Oxford Nanopore Technologies’ sequencing platform, in which a continuous time series of current measurements is mapped to a string of nucleotides. Current basecallers have applied connectionist temporal classification (CTC), a method originally developed for speech recognition, and focused on the task of decoding RNN output from a single read. We wish to extend this method for the more general task of consensus basecalling from multiple reads, and in doing so, exploit the gains in both accelerated algorithms for sequence analysis and recurrent neural networks, areas that have advanced in parallel over the past decade. To this end, we develop a dynamic programming algorithm for consensus decoding from a pair of RNNs, and show that it can be readily optimized with the use of an alignment envelope. We express this decoding in the notation of finite state automata, and show that pair RNN decoding can be compactly represented using automata operations. We additionally introduce a set of Markov chain Monte Carlo moves for consensus basecalling multiple reads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This builds on the interpretation of Scrappie, and similar CTC-decoding basecallers, as “transducer” neural networks (Tim Massingham, Oxford Nanopore Technologies, pers. comm.).

References

  1. Bouchard-Côté, A.: A note on probabilistic models over strings: the linear algebra approach. Bull. Math. Biol. 75(12), 2529–2550 (2013)

    Article  MathSciNet  Google Scholar 

  2. David, M., Dursi, L.J., Yao, D., Boutros, P.C., Simpson, J.T.: Nanocall: an open source basecaller for Oxford nanopore sequencing data. Bioinformatics 33(1), 49–55 (2017)

    Article  Google Scholar 

  3. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)

    Book  Google Scholar 

  4. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006). https://doi.org/10.1145/1143844.1143891

  5. Holmes, I.: Accelerated probabilistic inference of RNA structure evolution. BMC Bioinform. 6(73) (2005)

    Google Scholar 

  6. Holmes, I., Durbin, R.: Dynamic programming alignment accuracy. J. Comput. Biol. 5(3), 493–504 (1998)

    Article  Google Scholar 

  7. Holmes, I.H.: Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics 33(8), 1227–1229 (2017)

    Google Scholar 

  8. Loman, N.J., Quick, J., Simpson, J.T.: A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12(8), 733–735 (2015)

    Article  Google Scholar 

  9. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)

    Article  Google Scholar 

  10. Teng, H., Hall, M.B., Duarte, T., Cao, M.D., Coin, L.: Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. bioRxiv (2017). https://doi.org/10.1101/179531, https://www.biorxiv.org/content/early/2017/08/23/179531

  11. Westesson, O., Lunter, G., Paten, B., Holmes, I.: Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One 7(4), e34572 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

The authors were supported by NIH/NCI grant CA220441 and by NIH/NHGRI training grant T32 HG000047. We thank the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ian Holmes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silvestre-Ryan, J., Holmes, I. (2018). Consensus Decoding of Recurrent Neural Network Basecallers. In: Jansson, J., MartĂ­n-Vide, C., Vega-RodrĂ­guez, M. (eds) Algorithms for Computational Biology. AlCoB 2018. Lecture Notes in Computer Science(), vol 10849. Springer, Cham. https://doi.org/10.1007/978-3-319-91938-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91938-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91937-9

  • Online ISBN: 978-3-319-91938-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics