Abstract
There is an extensive literature using probabilistic models, such as hidden Markov models, for the analysis of biological sequences. These models have a clear theoretical basis, and many heuristics have been developed to reduce the time and memory requirements of the dynamic programming algorithms used for their inference. Nevertheless, mirroring the shift in natural language processing, bioinformatics is increasingly seeing higher accuracy predictions made by recurrent neural networks (RNN). This shift is exemplified by basecalling on the Oxford Nanopore Technologies’ sequencing platform, in which a continuous time series of current measurements is mapped to a string of nucleotides. Current basecallers have applied connectionist temporal classification (CTC), a method originally developed for speech recognition, and focused on the task of decoding RNN output from a single read. We wish to extend this method for the more general task of consensus basecalling from multiple reads, and in doing so, exploit the gains in both accelerated algorithms for sequence analysis and recurrent neural networks, areas that have advanced in parallel over the past decade. To this end, we develop a dynamic programming algorithm for consensus decoding from a pair of RNNs, and show that it can be readily optimized with the use of an alignment envelope. We express this decoding in the notation of finite state automata, and show that pair RNN decoding can be compactly represented using automata operations. We additionally introduce a set of Markov chain Monte Carlo moves for consensus basecalling multiple reads.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This builds on the interpretation of Scrappie, and similar CTC-decoding basecallers, as “transducer” neural networks (Tim Massingham, Oxford Nanopore Technologies, pers. comm.).
References
Bouchard-Côté, A.: A note on probabilistic models over strings: the linear algebra approach. Bull. Math. Biol. 75(12), 2529–2550 (2013)
David, M., Dursi, L.J., Yao, D., Boutros, P.C., Simpson, J.T.: Nanocall: an open source basecaller for Oxford nanopore sequencing data. Bioinformatics 33(1), 49–55 (2017)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006). https://doi.org/10.1145/1143844.1143891
Holmes, I.: Accelerated probabilistic inference of RNA structure evolution. BMC Bioinform. 6(73) (2005)
Holmes, I., Durbin, R.: Dynamic programming alignment accuracy. J. Comput. Biol. 5(3), 493–504 (1998)
Holmes, I.H.: Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics 33(8), 1227–1229 (2017)
Loman, N.J., Quick, J., Simpson, J.T.: A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12(8), 733–735 (2015)
Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)
Teng, H., Hall, M.B., Duarte, T., Cao, M.D., Coin, L.: Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. bioRxiv (2017). https://doi.org/10.1101/179531, https://www.biorxiv.org/content/early/2017/08/23/179531
Westesson, O., Lunter, G., Paten, B., Holmes, I.: Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One 7(4), e34572 (2012)
Acknowledgments
The authors were supported by NIH/NCI grant CA220441 and by NIH/NHGRI training grant T32 HG000047. We thank the anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Silvestre-Ryan, J., Holmes, I. (2018). Consensus Decoding of Recurrent Neural Network Basecallers. In: Jansson, J., MartĂn-Vide, C., Vega-RodrĂguez, M. (eds) Algorithms for Computational Biology. AlCoB 2018. Lecture Notes in Computer Science(), vol 10849. Springer, Cham. https://doi.org/10.1007/978-3-319-91938-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-91938-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91937-9
Online ISBN: 978-3-319-91938-6
eBook Packages: Computer ScienceComputer Science (R0)