Sequence-Specific Sequence Comparison Using Pairwise Statistical Significance

Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 696)


There has been a deluge of biological sequence data in the public domain, which makes sequence comparison one of the most fundamental computational problems in bioinformatics. The biologists routinely use pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is a well-known fact that almost everything in bioinformatics depends on the inter-relationship between sequence, structure, and function (all encapsulated in the term relatedness), which is far from being well understood. The potential relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone. This chapter presents a summary of recent advances in accurately estimating statistical significance of pairwise local alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence specific. Comparison of using pairwise statistical significance to rank database sequences, with well-known database search programs like BLAST, PSI-BLAST, and SSEARCH, is also presented. As expected, the sequence-comparison performance (evaluated in terms of retrieval accuracy) improves significantly as the sequence comparison process is made more and more sequence specific. Shortcomings of currently used approaches and some potentially useful directions for future work are also presented.



The authors would like to thank Dr. Sean Eddy for making the C routines of censored maximum likelihood fitting available online, Dr. William R. Pearson for making the benchmark protein comparison database available online, and Dr. Volker Brendel for helpful discussions and providing links to the data. This work was supported in part by NSF grants CNS-0551639, IIS-0536994, NSF HECURA CCF-0621443, and NSF SDCI OCI-0724599, NSF IIS-0905205, DOE FASTOS award number DE-FG02-08ER25848 and DOE SCIDAC-2: Scientific Data Management Center for Enabling Technologies (CET) grant DE-FC02-07ER25808.


  1. 1.
    Agrawal, A., Brendel, V., Huang, X.: Pairwise statistical significance versus database statistical significance for local alignment of protein sequences. In: Bioinformatics Research and Applications, LNCS(LNBI), vol. 4983, pp. 50–61. Berlin/Heidelberg: Springer (2008)Google Scholar
  2. 2.
    Agrawal, A., Brendel, V.P., Huang, X.: Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment. International Journal of Computational Biology and Drug Design 1(4), 347–367 (2008)PubMedCrossRefGoogle Scholar
  3. 3.
    Agrawal, A., Huang, X.: Pairwise statistical significance of local sequence alignment using substitution matrices with sequence-pair-specific distance. In: Proceedings of International Conference on Information Technology, ICIT, pp. 94–99 (2008)Google Scholar
  4. 4.
    Agrawal, A., Huang, X.: Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty. BMC Bioinformatics 10(Suppl 3), S1 (2009)PubMedCrossRefGoogle Scholar
  5. 5.
    Agrawal, A., Huang, X.: Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009). DOI 25 Sept. 2009
  6. 6.
    Agrawal, A., Huang, X.: PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance. Bioinformatics 25(8), 1082–1083 (2009). DOI10.1093/bioinformatics/btp089. URL Google Scholar
  7. 7.
    Altschul, S.F., Boguski, M.S., Gish, W., Wootton, J.C.: Issues in searching molecular sequence databases. Nature Genetics 6(2), 119–129 (1994)PubMedCrossRefGoogle Scholar
  8. 8.
    Altschul, S.F., Gish, W.: Local alignment statistics. Methods in Enzymology 266, 460–80 (1996)PubMedCrossRefGoogle Scholar
  9. 9.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997). DOI10.1093/nar/25.17.3389. URL
  10. 10.
    Eddy, S.R.: Maximum-likelihood fitting of extreme value distributions (1997). Available: URL Accessed 13 January 2011
  11. 11.
    Honbo, D., Agrawal, A., Choudhary, A.: Efficient pairwise statistical significance estimation using FPGAs. BIOCOMP, pp. 571–577 (2010)Google Scholar
  12. 12.
    Huang, X., Brutlag, D.L.: Dynamic use of multiple parameter sets in sequence alignment. Nucleic Acids Research 35(2), 678–686 (2007). DOI10.1093/nar/gkl1063. URL
  13. 13.
    Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences, USA 87(6), 2264–2268 (1990). DOI10.1073/pnas.87.6.2264. URL
  14. 14.
    Mitrophanov, A.Y., Borodovsky, M.: Statistical significance in biological sequence analysis. Briefings in Bioinformatics 7(1), 2–24 (2006). DOI10.1093/bib/bbk001Google Scholar
  15. 15.
    Mott, R.: Accurate formula for p-values of gapped local sequence and profile alignments. Journal of Molecular Biology 300, 649–659 (2000)PubMedCrossRefGoogle Scholar
  16. 16.
    Mott, R.: Alignment: Statistical Significance. Encyclopedia of Life Sciences (2005). URL
  17. 17.
    Olsen, R., Bundschuh, R., Hwa, T.: Rapid assessment of extremal statistics for gapped local alignment. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 211–222. AAAI Press (1999)Google Scholar
  18. 18.
    Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH - A hierarchic classification of protein domain structures. Structure 28(1), 1093–1108 (1997)CrossRefGoogle Scholar
  19. 19.
    Pagni, M., Jongeneel, C.V.: Making sense of score statistics for sequence alignments. Briefings in Bioinformatics 2(1), 51–67 (2001). DOI10.1093/bib/2.1.51Google Scholar
  20. 20.
    Pearson, W.R.: Empirical statistical estimates for sequence similarity searches. Journal of Molecular Biology 276, 71–84 (1998)PubMedCrossRefGoogle Scholar
  21. 21.
    Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology 132, 185–219 (2000)PubMedGoogle Scholar
  22. 22.
    Pearson, W.R., Wood, T.C.: Statistical significance in biological sequence comparison. In: D.J. Balding, M. Bishop, C. Cannings (eds.) Handbook of Statistical Genetics, pp. 39–66. Chichester, UK: Wiley (2001)Google Scholar
  23. 23.
    Rocha, J., Rosselló, F., Segura, J.: Compression ratios based on the universal similarity metric still yield protein distances far from cath distances. CoRR abs/q-bio/0603007 (2006)Google Scholar
  24. 24.
    Schäffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Research 29(14), 2994–3005 (2001)PubMedCrossRefGoogle Scholar
  25. 25.
    Sierk, M.L., Pearson, W.R.: Sensitivity and selectivity in protein structure comparison. Protein Science 13(3), 773–785 (2004). DOI10.1110/ps.03328504Google Scholar
  26. 26.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981). URL Google Scholar
  27. 27.
    Waterman, M.S., Vingron, M.: Rapid and accurate estimates of statistical significance for sequence database searches. Proceedings of the National Academy of Sciences, USA 91(11), 4625–4628 (1994). DOI10.1073/pnas.91.11.4625. URL
  28. 28.
    Yu, Y.K., Altschul, S.F.: The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21(7), 902–911 (2005). DOI10.1093/bioinformatics/bti070Google Scholar
  29. 29.
    Yu, Y.K., Gertz, E.M., Agarwala, R., Schäffer, A.A., Altschul, S.F.: Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Research 34(20), 5966–5973 (2006). DOI10.1093/nar/gkl731Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Department of Electrical Engineering and Computer ScienceNorthwestern UniversityEvanstonUSA

Personalised recommendations