Genome Mapping Statistics and Bioinformatics

  • Josyf C. Mychaleckyj
Part of the Methods in Molecular Biology™ book series (MIMB, volume 404)


The unprecedented availability of genome sequences, coupled with user-friendly, web-enabled search and analysis tools allows practitioners to locate interesting genome features or sequence tracts with relative ease. Although many public model organism- and genome-mapping resources offer pre-mapped genome browsing, biologists also still need to perform de novo mapping analyses. Correct interpretation of the results in genome annotation databases or the results of one’s individual analyses requires at least a conceptual understanding of the statistics and mechanics of genome searches, the expected results from statistical considerations, as well as the algorithms used by different search tools. This chapter introduces the basic statistical results that underlie mapping of nucleotide sequences to genomes and briefly surveys the common programs and algorithms that are used to perform genome mapping, all available via public hosted web sites. Selection of the appropriate sequence search and mapping tool will often demand tradeoffs in sensitivity and specificity relating to the statistics of the search.

Key Words

Bioinformatics genome annotation genome mapping genomics human genome mammalian genome sequence alignment sequence analysis sequence search 


  1. 1.
    Waterman, M. S. (1995) Introduction to Computational Biology. London, Chapman & Hall.Google Scholar
  2. 2.
    Ewens, W. J., and Grant, G. R. (2001) Statistical Methods in Bioinformatics. New York, Springer-Verlag.Google Scholar
  3. 3.
    International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931–945.CrossRefGoogle Scholar
  4. 4.
    Kent, W. J. (2002) BLAT-the BLAST-like alignment tool. Genome Res. 12, 6–664.Google Scholar
  5. 5.
    Shine, J., and Dalgarno, L. (1974) The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. U. S. A. 71, 1342–1346.PubMedCrossRefGoogle Scholar
  6. 6.
    Forsdyke, D. R., and Mortimer, J. R. (2000) Chargaff’s legacy. Gene 261, 127–137.PubMedCrossRefGoogle Scholar
  7. 7.
    Prabhu, V. V. (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 21, 2797–2800.PubMedCrossRefGoogle Scholar
  8. 8.
    Qi, D., and Cuticchia, A. J. (2001) Compositional symmetries in complete genomes. Bioinformatics 17, 557–559.PubMedCrossRefGoogle Scholar
  9. 9.
    Zimmermann, K., Schogl, D., and Mannhalter, J. W. (1998) Digestion of terminal restriction endonuclease recognition sites on PCR products. Biotechniques 24, 582–584.PubMedGoogle Scholar
  10. 10.
    Ma, J., Campbell, A., and Karlin, S. (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J. Bacteriol. 184, 5733–5745.PubMedCrossRefGoogle Scholar
  11. 11.
    van Helden, J., Rios, A. F., and Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28, 1808–1818.PubMedCrossRefGoogle Scholar
  12. 12.
    Klock, G., Strahle, U., and Schutz, G. (1987) Oestrogen and glucocorticoid responsive elements are closely related but distinct. Nature 329, 734–736.PubMedCrossRefGoogle Scholar
  13. 13.
    Klinge, C. M. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids Res. 29, 2905–2919.PubMedCrossRefGoogle Scholar
  14. 14.
    van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842.PubMedCrossRefGoogle Scholar
  15. 15.
    Needleman, S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.PubMedCrossRefGoogle Scholar
  16. 16.
    Smith, T. F., and Waterman, M. S. (1981) Identifi cation of common molecular subsequences. J. Mol. Biol. 147, 195–197.PubMedCrossRefGoogle Scholar
  17. 17.
    Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.PubMedCrossRefGoogle Scholar
  18. 18.
    Weber, J. L., David, D., Heil, J., Fan, Y., Zhao, C., and Marth, G. (2002) Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. 71, 854–862.PubMedCrossRefGoogle Scholar
  19. 19.
    Altschul, S. F., and Karlin, S. (1990) Methods for assessing the statistical significance of molecular sequences by using general scoring schemes. Proc. Natl. Acad. Sci. U. S. A. 87, 2264–2268.PubMedCrossRefGoogle Scholar
  20. 20.
    Karlin, S., and Altschul, S. F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U. S. A. 90, 5873–5877.PubMedCrossRefGoogle Scholar
  21. 21.
    Korf, I., Yandell, M., and Bedell, B. (2003) BLAST. Sebastopol, O’Reilly & Associates.Google Scholar
  22. 22.
    Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214.PubMedCrossRefGoogle Scholar
  23. 23.
    Ning, Z., Cox, A. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729.PubMedCrossRefGoogle Scholar
  24. 24.
    Bensen, J. T., Dawson, P. A., Mychaleckyj, J. C., and Bowden, D. W. (2001) Identification of a novel human cytokine gene in the interleukin gene cluster on chromosome 2q12–14. J. Interferon Cytokine Res. 21, 899–904.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press Inc., Totowa, NJ 2007

Authors and Affiliations

  • Josyf C. Mychaleckyj
    • 1
  1. 1.Center for Public Health GenomicsUniversity of VirginiaCharlottesville

Personalised recommendations