Fuzzy Genome Sequence Assembly for Single and Environmental Genomes

  • Sara Nasser
  • Adrienne Breland
  • Frederick C. HarrisJr.
  • Monica Nicolescu
  • Gregory L. Vert
Part of the Studies in Fuzziness and Soft Computing book series (STUDFUZZ, volume 242)

Summary

Traditional methods obtain a microorganism’s DNA by culturing it individually. Recent advances in genomics have lead to the procurement of DNA of more than one organism from its natural habitat. Indeed, natural microbial communities are often very complex with tens and hundreds of species. Assembling these genomes is a crucial step irrespective of the method of obtaining the DNA. This chapter presents fuzzy methods for multiple genome sequence assembly of cultured genomes (single organism) and environmental genomes (multiple organisms).

An optimal alignment of DNA genome fragments is based on several factors, such as the quality of bases and the length of overlap. Factors such as quality indicate if the data is high quality or an experimental error. We propose a sequence assembly solution based on fuzzy logic, which allows for tolerance of inexactness or errors in fragment matching and that can be used for improved assembly.

We propose fuzzy classification using modified fuzzy weighted averages to classify fragments belonging to different organisms within an environmental genome population. Our proposed approach uses DNA-based signatures such as GC content and nucleotide frequencies as features for the classification. This divide-and-conquer strategy also improves performance on larger datasets. We evaluate our method on artificially created environmental genomes to test various combinations of organisms and on an environmental genome.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baxevanis, A.D., Ouellette, B.F.F.: Bioinformatics: A practical guide to the analysis of genes and proteins, 1st edn. John Wiley, Chichester (2005)Google Scholar
  2. 2.
    Beja, O., Suzuki, M.T., Koonin, E.V., Aravind, L., Hadd, A., Nguyen, L.P., Villacorta, R., Amjadi, M., Garrigues, C., Jovanovich, S.B., Feldman, R.A., DeLong, E.F.: Construction and analysis of bacterial artificial chromosome libraries from a marine microbial assemblage. Environmental Microbiology 2, 516–529 (2000)CrossRefGoogle Scholar
  3. 3.
    Birdsell, J.A.: Integrating genomics, bioinformatics and classical genetics to study the effects of recombination on genome evolution. Molocular Biology Evolution 19, 1181–1197 (2002)Google Scholar
  4. 4.
    Brown, T.A.: Genomes, 3rd edn. Garland Science (2006)Google Scholar
  5. 5.
    Burge, C., Campbell, A.M., Karlin, S.: Over- and under-representation of short oligonucleotides in DNA sequences. Proceedings National Acaddemy of Science USA 89(4), 1358–1362 (1992)CrossRefGoogle Scholar
  6. 6.
    Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Computational Biology 1, 106–112 (2005)CrossRefGoogle Scholar
  7. 7.
    Choudhuri, S.: The path from nuclein to human genome: A brief history of DNA with a note on human genome sequencing and its impact on future research in biology. Bulletin of Science Technology Society 23, 360–367 (2003)CrossRefGoogle Scholar
  8. 8.
    Conant, G.C., Lewis, P.O.: Effects of nucleotide composition bias on the success of the parsimony criterion in phylogenetic inference. Molecular Biology Evolution 18, 1024–1033 (2001)Google Scholar
  9. 9.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to algorithms, 2nd edn., pp. 313–319. McGraw-Hill, New York (2001)MATHGoogle Scholar
  10. 10.
    Edmund, P.: A history of genome sequencing, Tech. report, Yale University Bioinformatics (2001)Google Scholar
  11. 11.
    Ewing, B., Green, P.: Basecalling of automated sequencer traces using phred. ii. error probabilities. Genome Research 8, 186–194 (1998)Google Scholar
  12. 12.
    Fleischmann, R., Adams, M., White, O., Clayton, R., Kirkness, E., Kerlavage, A., Bult, C., Tomb, J., Dougherty, B., Merrick, J.: Whole-genome random sequencing and assembly of Haemophilus Influenzae Rd. Science 269(5223), 496–512 (1995)CrossRefGoogle Scholar
  13. 13.
    Gasch, A.P., Eisen, M.B.: Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 3(11), 1–22 (2002)CrossRefGoogle Scholar
  14. 14.
    Gene, M.: Whole-genome DNA sequencing. IEEE Computational Engineering and Science 1, 33–43 (1999)Google Scholar
  15. 15.
    Green, P.: Documentation for phrap. Genome Center, University of Washington (2006)Google Scholar
  16. 16.
    Gutman, G.A., Hatfield, G.W.: Nonrandom utilization of codon pairs in Escherichia coli. Proceedings National Academy of Science USA 86, 3699–3703 (1989)CrossRefGoogle Scholar
  17. 17.
    Huang, X., Madan, A.: CAP3: A DNA sequence assembly program. Genome Research 9(9), 868–877 (1999)CrossRefGoogle Scholar
  18. 18.
    Hugenholtz, P.: Exploring prokaryotic diversity in the genomic era. Genome Biology 3, reviews0003.1–reviews0003.8 (2002)CrossRefGoogle Scholar
  19. 19.
    Karlin, S., Ladunga, I., Blaisdell, B.E.: Heterogeneity of genomes: Measures and values. Proceedings National Academy of Science USA 91, 12837–12841 (1994)CrossRefGoogle Scholar
  20. 20.
    Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7–51 (1995)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Looney, C.G.: Interactive clustering and merging with a new fuzzy expected value. Pattern Recognition 35, 2413–2423 (2002)MATHCrossRefGoogle Scholar
  22. 22.
    McHardy, A.C., Martín, H.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4(1), 63–72 (2007)CrossRefGoogle Scholar
  23. 23.
    Mongodin, E., Emerson, J., Nelson, K.: Microbial metagenomics. Genome Biology 6(10), 347 (2005)CrossRefGoogle Scholar
  24. 24.
    Nasser, S.: Fuzzy sequence classification and assembly of environmental genomes, Ph.D. thesis, University of Nevada Reno (2008)Google Scholar
  25. 25.
    Nasser, S., Vert, G., Nicolescu, M., Murray, A.: Multiple sequence alignment using fuzzy logic. In: Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, Honolulu, Hawaii, vol. 7, pp. 304–311 (2007)Google Scholar
  26. 26.
    Nasser, S., Vert, G.L., Breland, A., Nicolescu, M.: Fuzzy classification of genome sequences prior to assembly based on similarity measures. In: North American Fuzzy Information Processing Society, pp. 354–359 (2007)Google Scholar
  27. 27.
    NCBI, National center for biotechnology information, NIH (2007), http://www.ncbi.nlm.nih.gov/
  28. 28.
    Nowak, M.A.: Evolutionary dynamics: Exploring the equations of life, 1st edn. Belknap Press (October 2006)Google Scholar
  29. 29.
    Oliver, J.L., Marín, A.: A relationship between GC content and coding-sequence length. Journal of Molecular Evolution 43(3), 216–223 (2004)CrossRefGoogle Scholar
  30. 30.
    Otu, H.H., Sayood, K.: A divide-and-conquer approach to fragment assembly. Bioinformatics 19(1), 22–29 (2003)CrossRefGoogle Scholar
  31. 31.
    Peltola, H., Soderlund, H., Ukkonen, E.: Seqaid: A DNA sequence assembling program based on a mathematical model. Nucleic Acids Research 21(1), 307–321 (1984)CrossRefGoogle Scholar
  32. 32.
    Pop, M., Salzberg, S.L., Shumway, M.: Genome sequence assembly: Algorithms and issues. IEEE Computer 35(7), 47–54 (2002)Google Scholar
  33. 33.
    Pride, D.T., Meinersmann, R.J., Wassenaar, T.M., Blaser, M.J.: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research 13(2), 145–158 (2003)CrossRefGoogle Scholar
  34. 34.
    Rappe, M., Giovannoni, S.: The uncultured microbial majority. Annual Reviews Microbiology 57, 369–394 (2003)CrossRefGoogle Scholar
  35. 35.
    Reva, O., Tümmler, B.: Differentiation of regions with atypical oligonucleotide composition in bacterial genomes. BMC Bioinformatics 6(1), 251 (2005)CrossRefGoogle Scholar
  36. 36.
    Rondon, M.R., August, P.R., Bettermann, A.D., Bradly, S.F., Grossman, T.H., Liles, M.R., Loiacono, K.A., Lynch, B.A., MacNeil, I.A., Minor, C., Tiong, C.L., Gilman, M., Osburne, M.S., Clardy, J., Handelsman, J., Goodman, R.M.: Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorgansims. Applications Environmental Microbiology 66, 2541–2547 (2000)CrossRefGoogle Scholar
  37. 37.
    Sadegh-Zadeh, K.: Fuzzy genomes. Artificial Intelligent Medicine 18(1), 1–28 (2000)CrossRefGoogle Scholar
  38. 38.
    Sanger, F., Coulson, A.R., Hong, G.F., Hill, D.F., Petersen, G.B.: Nucleotide sequence of Bacteriophage Lambda DNA. Journal Molecular Biology 162(4), 729–773 (1982)CrossRefGoogle Scholar
  39. 39.
    Sanger, F., Nicklen, S., Coulson, A.R.: DNA sequencing with chain-terminating inhibitors. Proceedings National Academy of Science USA 74(12), 5463–5467 (1977)CrossRefGoogle Scholar
  40. 40.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)CrossRefGoogle Scholar
  41. 41.
    Stein, J.L., Marsh, T.L., Wu, K.Y., Shizuya, H., DeLong, E.F.: Characterization of uncultivated prokaryotes: isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon. Journal of Bacteriology 178, 591–599 (1996)Google Scholar
  42. 42.
    Sutton, G., White, O., Adams, M., Kerlavage, A.: TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1, 9–19 (1995)Google Scholar
  43. 43.
    Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glockner, F.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6, 938–947 (2004)CrossRefGoogle Scholar
  44. 44.
    Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., Glockner, F.O.: TETRA: A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinformatics 5, 163 (2004)CrossRefGoogle Scholar
  45. 45.
    Turnbaugh, P.J., Ley, R.E., Mahowald, M.A., Magrini, V., Mardis, E.R., Gordon, J.I.: An obesity–associated gut microbiome with increased capacity for energy harvest. Nature 444(7122), 1009–1010 (2006)CrossRefGoogle Scholar
  46. 46.
    Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M., Solovyev, V.V., Rubin, E.M., Rokhsar, D.S., Banfield, J.F.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004)CrossRefGoogle Scholar
  47. 47.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MATHCrossRefMathSciNetGoogle Scholar
  48. 48.
    Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., Fouts, D.E., Levy, S., Knap, A.H., Lomas, M.W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y.-H., Smith, H.O.: Environmental genome shotgun sequencing of the sargasso sea. Science 304, 66–74 (2004)CrossRefGoogle Scholar
  49. 49.
    Welch, R.A., Burland, V., Plunkett, G., Redford, P., Roesch, P., Rasko, D., Buckles, E.L., Liou, S.R., Boutin, A., Hackett, J., Stroud, D., Mayhew, G.F., Rose, D.J., Zhou, S., Schwartz, D.C., Perna, N.T., Mobley, H.L., Donnenberg, M.S., Blattner, F.R.: Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proceedings National Academy of Science USA 99(26), 17020–17024 (2002)CrossRefGoogle Scholar
  50. 50.
    Wong, L.: The practical bioinformatician, 1st edn. World Scientific Publishing Company, Singapore (2004)Google Scholar
  51. 51.
    Woyke, T., Teeling, H., Ivanova, N.N., Huntemann, M., Richter, M., Gloeckner, F.O., Boffelli, D., Anderson, I.J., Barry, K.W., Shapiro, H.J., Szeto, E., Kyrpides, N.C., Mussmann, M., Amann, R., Bergin, C., Ruehland, C., Rubin, E.M., Dubilier, N.: Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443(7114), 925–927 (2006)CrossRefGoogle Scholar
  52. 52.
    Xu, D., Bondugula, R., Popescu, M., Keller, J.: Bioinformatics and fuzzy logic. In: IEEE International Conference on Fuzzy Systems, Vancouver, BC, pp. 817–824 (2006)Google Scholar
  53. 53.
    Zadeh, L.A.: Fuzzy logic and approximate reasoning. Synthese 30 (1975)Google Scholar
  54. 54.
    Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. Journal of Computational Biology 7, 203–214 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Sara Nasser
    • 1
  • Adrienne Breland
    • 1
  • Frederick C. HarrisJr.
    • 1
  • Monica Nicolescu
    • 1
  • Gregory L. Vert
    • 1
  1. 1.Department of Computer Science & EngineeringUniversity of Nevada RenoRenoUSA

Personalised recommendations