A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

  • Jonathan N. WellsEmail author
  • Joseph A. Marsh
Part of the Methods in Molecular Biology book series (MIMB, volume 1851)


Reconstructing evolutionary relationships in repeat proteins is notoriously difficult due to the high degree of sequence divergence that typically occurs between duplicated repeats. This is complicated further by the fact that proteins with a large number of similar repeats are more likely to produce significant local sequence alignments than proteins with fewer copies of the repeat motif. Furthermore, biologically correct sequence alignments are sometimes impossible to achieve in cases where insertion or translocation events disrupt the order of repeats in one of the sequences being aligned. Combined, these attributes make traditional phylogenetic methods for studying protein families unreliable for repeat proteins, due to the dependence of such methods on accurate sequence alignment.

We present here a practical solution to this problem, making use of graph clustering combined with the open-source software package HH-suite, which enables highly sensitive detection of sequence relationships. Carrying out multiple rounds of homology searches via alignment of profile hidden Markov models, large sets of related proteins are generated. By representing the relationships between proteins in these sets as graphs, subsequent clustering with the Markov cluster algorithm enables robust detection of repeat protein subfamilies.

Key words

Repeat proteins Sequence homology Graph clustering Profile-HMM alignment Protein families Evolution 


  1. 1.
    Kajava AV (2001) Review: proteins with repeated sequence—structural prediction and modeling. J Struct Biol 134:132–144. Scholar
  2. 2.
    Kajava AV (2012) Tandem repeats in proteins: from sequence to structure. J Struct Biol 179:279–288. Scholar
  3. 3.
    Kobe B, Deisenhofer J (1994) The leucine-rich repeat: a versatile binding motif. Trends Biochem Sci 19:415–421CrossRefGoogle Scholar
  4. 4.
    Neer EJ, Schmidt CJ, Nambudripad R, Smith TF (1994) The ancient regulatory-protein family of WD-repeat proteins. Nature 371:297–300. Scholar
  5. 5.
    Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D (1999) A census of protein repeats. J Mol Biol 293:151–160. Scholar
  6. 6.
    Schaper E, Gascuel O, Anisimova M (2014) Deep conservation of human protein tandem repeats within the eukaryotes. Mol Biol Evol 31:1132–1148. Scholar
  7. 7.
    Andrade MA, Petosa C, O’Donoghue SI et al (2001) Comparison of ARM and HEAT protein repeats. J Mol Biol 309:1–18. Scholar
  8. 8.
    Sutherland TD, Campbell PM, Weisman S et al (2006) A highly divergent gene cluster in honey bees encodes a novel silk family. Genome Res 16:1414–1421. Scholar
  9. 9.
    Björklund ÅK, Ekman D, Elofsson A (2006) Expansion of protein domain repeats. PLoS Comput Biol 2:0959–0970. Scholar
  10. 10.
    Schüler A, Bornberg-Bauer E (2016) Evolution of protein domain repeats in Metazoa. Mol Biol Evol 33:3170CrossRefGoogle Scholar
  11. 11.
    Persi E, Wolf YI, Koonin EV (2016) Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins. Nat Commun 7:13570. Scholar
  12. 12.
    Szklarczyk R, Heringa J (2004) Tracking repeats using significance and transitivity. Bioinformatics 20(Suppl 1):i311–i317. Scholar
  13. 13.
    Söding J, Remmert M, Biegert A, Lupas AN (2006) HHsenser: exhaustive transitive profile search using HMM-HMM comparison. Nucleic Acids Res 34:374–378. Scholar
  14. 14.
    Newman AM, Cooper JB (2007) XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics 8:382. Scholar
  15. 15.
    Vo A, Nguyen N, Huang H (2010) Solenoid and non-solenoid protein recognition using stationary wavelet packet transform. Bioinformatics 26:i467–i473. Scholar
  16. 16.
    Szalkowski AM, Anisimova M (2013) Graph-based modeling of tandem repeats improves global multiple sequence alignment. Nucleic Acids Res 41:e162–e162. Scholar
  17. 17.
    Schaper E, Kajava AV, Hauser A, Anisimova M (2012) Repeat or not repeat?--Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res 40:10005–10017. Scholar
  18. 18.
    Soding J, Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. Scholar
  19. 19.
    Remmert M, Biegert A, Hauser A, Söding J (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. Scholar
  20. 20.
    Van Dongen S (2000) A cluster algorithm for graphs. Rep Inf Syst 10:1–40CrossRefGoogle Scholar
  21. 21.
    Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584CrossRefGoogle Scholar
  22. 22.
    Wells JN, Gligoris TG, Nasmyth KA, Marsh JA (2017) Evolution of condensin and cohesin complexes driven by replacement of kite by hawk proteins. Curr Biol 27:R17–R18. Scholar
  23. 23.
    Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763CrossRefGoogle Scholar
  24. 24.
    Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269. Scholar
  25. 25.
    Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. Scholar
  26. 26.
    Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402CrossRefGoogle Scholar
  27. 27.
    Cline MS, Smoot M, Cerami E et al (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2:2366–2382. Scholar
  28. 28.
    Chavali S, Chavali PL, Chalancon G et al (2017) Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins. Nat Struct Mol Biol 24:765–777. Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.MRC Human Genetics Unit, MRC Institute of Genetics and Molecular MedicineUniversity of EdinburghEdinburghUK

Personalised recommendations