Multiple Sequence Alignment System for Pyrosequencing Reads

  • Fahad Saeed
  • Ashfaq Khokhar
  • Osvaldo Zagordi
  • Niko Beerenwinkel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5462)

Abstract

Pyrosequencing is among the emerging sequencing techniques, capable of generating upto 100,000 overlapping reads in a single run. This technique is much faster and cheaper than the existing state of the art sequencing technique such as Sanger. However, the reads generated by pyrosequencing are short in size and contain numerous errors. In order to use these reads for any subsequent analysis, the reads must be aligned . Existing multiple sequence alignment methods cannot be used as they do not take into account the specific positions of the sequences with respect to the genome, and are highly inefficient for large number of sequences. Therefore, the common practice has been to use either simple pairwise alignment despite its poor accuracy for error prone pyroreads, or use computationally expensive techniques based on sequential gap propagation. In this paper, we develop a computationally efficient method based on domain decomposition, referred to as pyro-align, to align such large number of reads. The proposed alignment algorithm accurately aligns the erroneous reads in a short period of time, which is orders of magnitude faster than any existing method. The accuracy of the alignment is confirmed from the consensus obtained from the multiple alignments.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Saeed, F., Khokhar, A.: Sample-Align-D: A High Performance Multiple Sequence Alignment System using Phylogenetic Sampling and Domain Decomposition. In: Proc. 23rd IEEE International Parallel and Distributed Processing Symposium (April 2007)Google Scholar
  2. 2.
    Hou1, X.-L., Cao, Q.-Y., Jia, H.-Y., Chen, Z.: Pyrosequencing analysis of the gyrB gene to differentiate bacteria responsible for diarrheal diseases. European Journal of Clinical Microbiology & Infectious Diseases 27(7), 587–596 (2007)CrossRefGoogle Scholar
  3. 3.
    Liu, Z., Lozupone, C., Hamady, M., Bushman, F.D., Knight, R.: Short pyrosequencing reads suffice for accurate microbial community analysis. Nucl. Acids Res. 541 (2007)Google Scholar
  4. 4.
    Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucl. Acids Res. 32(1), 380–385 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)CrossRefPubMedGoogle Scholar
  6. 6.
    Thompson, J.D., Plewniak, F., Poch, O.: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1), 87–88 (1999)CrossRefPubMedGoogle Scholar
  7. 7.
    Pocock, M., Down, T., Hubbard, T.: BioJava: open source components for bioinformatics. SIGBIO Newsl 20(2), 10–12 (2000)CrossRefGoogle Scholar
  8. 8.
    Setubal, C., Meidanis, J.: Introduction to Computational Molecular Biology (January 1997)Google Scholar
  9. 9.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology (January 1997)Google Scholar
  10. 10.
    Gusfield, D.: Efficient methods for multiple sequence alignment with guaranteed error bounds. Computer Science Division, UC Davis, Technical Report CSE 91-4 (1991)Google Scholar
  11. 11.
    Schmid, R., Schuster, S.C., Steel, M.A., Huson, D.H.: ReadSim-A simulator for Sanger and 454 sequencing (2006)Google Scholar
  12. 12.
    Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C., Gharizadeh, B., Ronaghi, M., Shafer, R.W., Beerenwinkel, N.: Viral Population Estimation Using Pyrosequencing: PLoS Comput Biol. Public Library of Science 4 (May 2008)Google Scholar
  13. 13.
    Wang, C., Mitsuya, Y., Gharizadeh, B., Ronaghi, M.: Characterization of mutation spectra with ultra-deep pyrosequencing, application to HIV-1 drug resistance. Genome Res. 17(8), 1195–1201 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Zagordi, O., Geyrhofer, L., Roth, V., Beerenwinkel, N.: Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. In: RECOMB 2009 (accepted paper) (2009)Google Scholar
  15. 15.
    Hutchison III, C.A.: DNA sequencing, bench to bedside and beyond. Nucleic Acids Research 35, 6227–6237 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Wang, L., Jiang, T.: On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology 1(4), 337–348 (1994)CrossRefPubMedGoogle Scholar
  17. 17.
    Notredame, C., Higgins, D., Heringa, J.: T-coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology 302, 205–217 (2000)CrossRefPubMedGoogle Scholar
  18. 18.
    Thompson, J., Higgins, D., Gibson, T.J.: Clustal w: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 222, 4673–4690 (1994)CrossRefGoogle Scholar
  19. 19.
    Edgar, R.C.: MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput. Nucleic Acids Research 32(5) (2004)Google Scholar
  20. 20.
    Edgar, R.C.: MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity. BMC Bioinformatics, 1471–2105 (2004)Google Scholar
  21. 21.
    Morgenstern, B.: DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Research 32, 33–36 (2004)CrossRefGoogle Scholar
  22. 22.
    Saeed, F., Khokhar, A.: A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms. Journal of Parallel and Distributed Computing (to appear)Google Scholar
  23. 23.
    Do, C.B., Mahabhashyam, M.S.P., Brudno, M., Batzoglou, S.: PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 15, 330–340 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT A Novel Method for Rapid Multiple Sequence Alignment based on Fast Fourier Transform. Nucleic Acids Res. 30(14), 3059–3066 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Altschul, S.F.: Amino acid substitution matrices from an information theoretic prospective. J. Mol. Biol. 219(3), 555–565 (1991)CrossRefPubMedGoogle Scholar
  26. 26.
    Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of mutation data matrices from protein sequences. BMC Bioinformatics 8(3), 275–282 (1991)CrossRefGoogle Scholar
  27. 27.
    Müller, T., Spang, R., Vingron, M.: Estimating Amino Acid Substitution Models: A Comparison of Dayhoff’s Estimator, the Resolvent Approach and a Maximum Likelihood Method. Mol. Bio. Evol. 19(1), 8–13 (2002)CrossRefGoogle Scholar
  28. 28.
    Edgar, R.C., Sjolander, K.: A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20(8), 1301–1308 (2004)CrossRefPubMedGoogle Scholar
  29. 29.
    Huse, S., Huber, J., Morrison, H., Sogin, M., Welch, D.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8(7), R143 (2007)CrossRefGoogle Scholar
  30. 30.
    Roche Applied Sciences:GS20 Data Processing Software Manual:Penzberg: Roche Diagnostics GmbH (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Fahad Saeed
    • 1
  • Ashfaq Khokhar
    • 1
  • Osvaldo Zagordi
    • 2
  • Niko Beerenwinkel
    • 2
  1. 1.Department of Electrical and Computer EngineeringUniversity of Illinois at ChicagoUSA
  2. 2.Department of Biosystems Science and EngineeringETH ZurichBaselSwitzerland

Personalised recommendations