Consensus Decoding of Recurrent Neural Network Basecallers

Silvestre-Ryan, Jordi; Holmes, Ian

doi:10.1007/978-3-319-91938-6_11

Jordi Silvestre-Ryan¹⁶ &
Ian Holmes¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10849))

Included in the following conference series:

International Conference on Algorithms for Computational Biology

517 Accesses

Abstract

There is an extensive literature using probabilistic models, such as hidden Markov models, for the analysis of biological sequences. These models have a clear theoretical basis, and many heuristics have been developed to reduce the time and memory requirements of the dynamic programming algorithms used for their inference. Nevertheless, mirroring the shift in natural language processing, bioinformatics is increasingly seeing higher accuracy predictions made by recurrent neural networks (RNN). This shift is exemplified by basecalling on the Oxford Nanopore Technologies’ sequencing platform, in which a continuous time series of current measurements is mapped to a string of nucleotides. Current basecallers have applied connectionist temporal classification (CTC), a method originally developed for speech recognition, and focused on the task of decoding RNN output from a single read. We wish to extend this method for the more general task of consensus basecalling from multiple reads, and in doing so, exploit the gains in both accelerated algorithms for sequence analysis and recurrent neural networks, areas that have advanced in parallel over the past decade. To this end, we develop a dynamic programming algorithm for consensus decoding from a pair of RNNs, and show that it can be readily optimized with the use of an alignment envelope. We express this decoding in the notation of finite state automata, and show that pair RNN decoding can be compactly represented using automata operations. We additionally introduce a set of Markov chain Monte Carlo moves for consensus basecalling multiple reads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This builds on the interpretation of Scrappie, and similar CTC-decoding basecallers, as “transducer” neural networks (Tim Massingham, Oxford Nanopore Technologies, pers. comm.).

References

Bouchard-Côté, A.: A note on probabilistic models over strings: the linear algebra approach. Bull. Math. Biol. 75(12), 2529–2550 (2013)
Article MathSciNet Google Scholar
David, M., Dursi, L.J., Yao, D., Boutros, P.C., Simpson, J.T.: Nanocall: an open source basecaller for Oxford nanopore sequencing data. Bioinformatics 33(1), 49–55 (2017)
Article Google Scholar
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006). https://doi.org/10.1145/1143844.1143891
Holmes, I.: Accelerated probabilistic inference of RNA structure evolution. BMC Bioinform. 6(73) (2005)
Google Scholar
Holmes, I., Durbin, R.: Dynamic programming alignment accuracy. J. Comput. Biol. 5(3), 493–504 (1998)
Article Google Scholar
Holmes, I.H.: Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics 33(8), 1227–1229 (2017)
Google Scholar
Loman, N.J., Quick, J., Simpson, J.T.: A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12(8), 733–735 (2015)
Article Google Scholar
Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)
Article Google Scholar
Teng, H., Hall, M.B., Duarte, T., Cao, M.D., Coin, L.: Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. bioRxiv (2017). https://doi.org/10.1101/179531, https://www.biorxiv.org/content/early/2017/08/23/179531
Westesson, O., Lunter, G., Paten, B., Holmes, I.: Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One 7(4), e34572 (2012)
Article Google Scholar

Download references

Acknowledgments

The authors were supported by NIH/NCI grant CA220441 and by NIH/NHGRI training grant T32 HG000047. We thank the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Department of Bioengineering, University of California, Berkeley, USA
Jordi Silvestre-Ryan & Ian Holmes

Authors

Jordi Silvestre-Ryan
View author publications
You can also search for this author in PubMed Google Scholar
Ian Holmes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ian Holmes .

Editor information

Editors and Affiliations

The Hong Kong Polytechnic University, Kowloon, Hong Kong
Jesper Jansson
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
University of Extremadura, Cáceres, Spain
Miguel A. Vega-Rodríguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silvestre-Ryan, J., Holmes, I. (2018). Consensus Decoding of Recurrent Neural Network Basecallers. In: Jansson, J., Martín-Vide, C., Vega-Rodríguez, M. (eds) Algorithms for Computational Biology. AlCoB 2018. Lecture Notes in Computer Science(), vol 10849. Springer, Cham. https://doi.org/10.1007/978-3-319-91938-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-91938-6_11
Published: 17 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91937-9
Online ISBN: 978-3-319-91938-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics