naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing

  • Wei-Chun Kao
  • Yun S. Song
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6044)


Immense amounts of raw instrument data (i.e., images of fluorescence) are currently being generated using ultra high-throughput sequencing platforms. An important computational challenge associated with this rapid advancement is to develop efficient algorithms that can extract accurate sequence information from raw data. To address this challenge, we recently introduced a novel model-based base-calling algorithm that is fully parametric and has several advantages over previously proposed methods. Our original algorithm, called BayesCall, significantly reduced the error rate, particularly in the later cycles of a sequencing run, and also produced useful base-specific quality scores with a high discrimination ability. Unfortunately, however, BayesCall is too computationally expensive to be of broad practical use. In this paper, we build on our previous model-based approach to devise an efficient base-calling algorithm that is orders of magnitude faster than BayesCall, while still maintaining a comparably high level of accuracy. Our new algorithm is called naiveBayesCall, and it utilizes approximation and optimization methods to achieve scalability. We describe the performance of naiveBayesCall and demonstrate how improved base-calling accuracy may facilitate de novo assembly when the coverage is low to moderate.


Quality Score Hybrid Algorithm Viterbi Algorithm Sequence Matrix Comparable Error Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bentley, D.R.: Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552 (2006)CrossRefGoogle Scholar
  2. 2.
    Brockman, W., Alvarez, P., Young, S., Garber, M., Giannoukos, G., Lee, W.L., Russ, C., Lander, E.S., Nusbaum, C., Jaffe, D.B.: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 18, 763–770 (2008)CrossRefGoogle Scholar
  3. 3.
    Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., Jaffe, D.B.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 18(5), 810–820 (2008)CrossRefGoogle Scholar
  4. 4.
    Chaisson, M.J.P., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome research (2008)Google Scholar
  5. 5.
    Erlich, Y., Mitra, P., Delabastide, M., McCombie, W., Hannon, G.: Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat. Methods 5, 679–682 (2008)CrossRefGoogle Scholar
  6. 6.
    Ewing, B., Green, P.: Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Research 8(3), 186–194 (1998)Google Scholar
  7. 7.
    Hellmann, I., Mang, Y., Gu, Z., Li, P., Vega, F.M.D.L., Clark, A.G., Nielsen, R.: Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome Res. 18(7), 1020–1029 (2008)CrossRefGoogle Scholar
  8. 8.
    Jiang, R., Tavare, S., Marjoram, P.: Population genetic inference from resequencing data. Genetics 181(1), 187–197 (2009)CrossRefGoogle Scholar
  9. 9.
    Kao, W.C., Stevens, K., Song, Y.S.: BayesCall: A model-based basecalling algorithm for high-throughput short-read sequencing. Genome Research 19, 1884–1895 (2009)CrossRefGoogle Scholar
  10. 10.
    Kiefer, J.: Sequential minimax search for a maximum. Proceedings of the American Mathematical Society 4, 502–506 (1953)Google Scholar
  11. 11.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 25, R25 (2009)Google Scholar
  12. 12.
    Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)CrossRefGoogle Scholar
  13. 13.
    Li, L., Speed, T.: An estimate of the crosstalk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis 20, 1433–1442 (1999)CrossRefGoogle Scholar
  14. 14.
    Medvedev, P., Brudno, M.: Ab Initio Whole Genome Shotgun Assembly with Mated Short Reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 50–64. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Metzker, M.L.: Emerging technologies in DNA sequencing. Genome Res. 15(12), 1767–1776 (2005)CrossRefGoogle Scholar
  16. 16.
    Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F.: Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9, 431 (2008)CrossRefGoogle Scholar
  17. 17.
    Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P., Batzoglou, S.: Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS One 2(5), e484 (2007)Google Scholar
  18. 18.
    Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2), 260–269 (1967)zbMATHCrossRefGoogle Scholar
  19. 19.
    Whiteford, N., Skelly, T., Curtis, C., Ritchie, M., Lohr, A., Zaranek, A., Abnizova, I., Brown, C.: Swift: Primary Data Analysis for the Illumina Solexa Sequencing Platform. Bioinformatics 25(17), 2194–2199 (2009)CrossRefGoogle Scholar
  20. 20.
    Yin, Z., Severin, J., Giddings, M.C., Huang, W.A., Westphall, M.S., Smith, L.M.: Automatic matrix determination in four dye fluorescence-based DNA sequencing. Electrophoresis 17, 1143–1150 (1996)CrossRefGoogle Scholar
  21. 21.
    Zerbino, D.R., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18(5), 821–829 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Wei-Chun Kao
    • 1
  • Yun S. Song
    • 1
    • 2
  1. 1.Computer Science DivisionUniversity of CaliforniaBerkeleyUSA
  2. 2.Department of StatisticsUniversity of CaliforniaBerkeleyUSA

Personalised recommendations