Identification of Protein Homologs and Domain Boundaries by Iterative Sequence Alignment

  • Dustin Schaeffer
  • Nick V. Grishin
Part of the Methods in Molecular Biology book series (MIMB, volume 1851)


Evolutionary domains are protein regions with observable sequence similarity to other known domains. Here we describe how to use common sequence and profile alignment algorithms (i.e., BLAST, HHsearch) to delineate putative domains in novel protein sequences, given a reference library of protein domains. In this case, we use our database of evolutionary domains (ECOD) as a reference, but other domain sequence libraries could be used (e.g., SCOP, CATH). We describe our domain partition algorithm along with specific notes on how to avoid domain indexing errors when working with multiple data sources and software algorithms with differing outputs.

Key words

Protein domains Homologs Sequence alignment 



This work was supported in part by the National Institutes of Health (GM094575 to NVG) and the Welch Foundation (I-1505 to NVG).


  1. 1.
    Soding J, Lupas AN (2003) More than the sum of their parts: on the evolution of proteins from peptides. BioEssays 25(9):837–846CrossRefGoogle Scholar
  2. 2.
    Leipe DD, Aravind L, Grishin NV, Koonin EV (2000) The bacterial replicative helicase DnaB evolved from a RecA duplication. Genome Res 10(1):5–16PubMedGoogle Scholar
  3. 3.
    Tyzack JD, Furnham N, Sillitoe I, Orengo CM, Thornton JM (2017) Understanding enzyme function evolution from a computational perspective. Curr Opin Struct Biol 47(Suppl C):131–139. Scholar
  4. 4.
    Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, Kim BH, Grishin NV (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 10(12):e1003926. Scholar
  5. 5.
    Song N, Sedgewick RD, Durand D (2007) Domain architecture comparison for multidomain homology identification. J Comput Biol 14(4):496–516. Scholar
  6. 6.
    Holland TA, Veretnik S, Shindyalov IN, Bourne PE (2006) Partitioning protein structures into domains: why is it so difficult? J Mol Biol 361(3):562–590. Scholar
  7. 7.
    Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36(Database issue):D419–D425PubMedGoogle Scholar
  8. 8.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402CrossRefGoogle Scholar
  9. 9.
    Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7):951–960. Scholar
  10. 10.
    Remmert M, Biegert A, Hauser A, Soding J (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173. Scholar
  11. 11.
    Cheng H, Liao Y, Schaeffer RD, Grishin NV (2015) Manual classification strategies in the ECOD database. Proteins 83(7):1238–1251. Scholar
  12. 12.
    Westbrook J, Ito N, Nakamura H, Henrick K, Berman HM (2005) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7):988–992. Scholar
  13. 13.
    Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152. Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Dustin Schaeffer
    • 1
  • Nick V. Grishin
    • 1
    • 2
  1. 1.Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasUSA
  2. 2.Howard Hughes Medical InstituteUniversity of Texas Southwestern Medical CenterDallasUSA

Personalised recommendations