Skip to main content

Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection

  • Protocol
  • First Online:
Data Mining Techniques for the Life Sciences

Abstract

Sequence-based approaches are fundamental to guide experimental investigations in obtaining structural and/or functional insights into uncharacterized protein families. Powerful profile-based sequence search methods rely on a sequence space continuum to identify non-trivial relationships through homology detection. The computational design of protein-like sequences that serve as “artificial linkers” is useful in identifying relationships between distant members of a structural fold. Such sequences act as intermediates and guide homology searches between distantly related proteins. Here, we describe an approach that represents natural intermediate sequences and designed protein-like sequences as HMM (Hidden Markov Models) profiles, to improve the sensitivity of existing search methods. Searches made within the “Profile database” were shown to recognize the parent structural fold for 90% of the search queries at query coverage better than 60%. For 1040 protein families with no available structure, fold associations were made through searches in the database of natural and designed sequence profiles. Most of the associations were made with the Alpha-alpha superhelix, Transmembrane beta-barrels, TIM barrel, and Immunoglobulin-like beta-sandwich folds. For 11 domain families of unknown functions, we provide confident fold associations using the profiles of designed sequences and a consensus from other fold recognition methods. For two DUFs (Domain families of Unknown Functions), we performed detailed functional annotation through comparisons with characterized templates of families of known function.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jones DT, Miller RT, Thornton JM (1995) Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing. Proteins Struct Funct Genet 23:387–397. https://doi.org/10.1002/prot.340230312

    Article  CAS  PubMed  Google Scholar 

  2. Jones DT (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287:797–815. https://doi.org/10.1006/jmbi.1999.2583

    Article  CAS  PubMed  Google Scholar 

  3. Kelley LA, MacCallum RM, Sternberg MJ (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299:501–522. https://doi.org/10.1006/JMBI.2000.3741

    Article  Google Scholar 

  4. Wang Y, Virtanen J, Xue Z, Zhang Y (2017) I-TASSER-MR: automated molecular replacement for distant-homology proteins using iterative fragment assembly and progressive sequence truncation. Nucleic Acids Res 45:W429–W434. https://doi.org/10.1093/nar/gkx349

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10:845–858. https://doi.org/10.1038/nprot.2015.053

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Xu D, Jaroszewski L, Li Z, Godzik A (2014) FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking. Bioinformatics 30:660–667. https://doi.org/10.1093/bioinformatics/btt578

    Article  CAS  PubMed  Google Scholar 

  7. Ghouzam Y, Postic G, Guerin P-E, de Brevern AG, Gelly J-C (2016) ORION: a web server for protein fold recognition and structure prediction using evolutionary hybrid profiles. Sci Rep 6:28268. https://doi.org/10.1038/srep28268

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Wu S, Zhang Y (2007) LOMETS: a local meta-threading-server for protein structure prediction. Nucleic Acids Res 35:3375–3382. https://doi.org/10.1093/nar/gkm251

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Xu J, Li M, Kim D, Xu Y (2003) Raptor: optimal protein threading by linear programming. J Bioinforma Comput Biol 1:95–117. https://doi.org/10.1142/S0219720003000186

    Article  CAS  Google Scholar 

  10. Zhu J, Zhang H, Li SC, Wang C, Kong L, Sun S, Zheng W-M, Bu D (2017) Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts. Bioinformatics 33:3749–3757. https://doi.org/10.1093/bioinformatics/btx514

    Article  CAS  PubMed  Google Scholar 

  11. Saidi R, Maddouri M, Mephu Nguifo E (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 11:175. https://doi.org/10.1186/1471-2105-11-175

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Wei L, Liao M, Gao X, Zou Q (2015) Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans Nanobiosci 14:649–659. https://doi.org/10.1109/TNB.2015.2450233

    Article  Google Scholar 

  13. Ibrahim W, Abadeh MS (2017) Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J Theor Biol 421:1–15. https://doi.org/10.1016/j.jtbi.2017.03.023

    Article  CAS  PubMed  Google Scholar 

  14. Lyons J, Paliwal KK, Dehzangi A, Heffernan R, Tsunoda T, Sharma A (2016) Protein fold recognition using HMM–HMM alignment and dynamic programming. J Theor Biol 393:67–74. https://doi.org/10.1016/J.JTBI.2015.12.018

    Article  CAS  PubMed  Google Scholar 

  15. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A (2009) Protein function annotation by homology-based inference. Genome Biol 10:207. https://doi.org/10.1186/gb-2009-10-2-207

    Article  PubMed  PubMed Central  Google Scholar 

  16. Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng Des Sel 12:85–94. https://doi.org/10.1093/protein/12.2.85

    Article  CAS  Google Scholar 

  17. Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15:275–284. https://doi.org/10.1016/J.SBI.2005.04.003

    Article  CAS  PubMed  Google Scholar 

  18. Bru C, Courcelle E, Carrère S, Beausse Y, Dalmar S, Kahn D (2004) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33:D212–D215. https://doi.org/10.1093/nar/gki034

    Article  CAS  PubMed Central  Google Scholar 

  19. Jones DT, Swindells MB (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27:161–164

    Article  CAS  Google Scholar 

  20. Sandhya S, Kishore S, Sowdhamini R, Srinivasan N (2003) Effective detection of remote homologues by searching in sequence dataset of a protein domain fold. FEBS Lett 552:225–230. https://doi.org/10.1016/S0014-5793(03)00929-3

    Article  CAS  PubMed  Google Scholar 

  21. Koretke KK, Russell RB, Copley RR, Lupas AN (1999) Fold recognition using sequence and secondary structure information. Proteins Suppl 3:141–148

    Article  Google Scholar 

  22. Krishnadev O, Srinivasan N (2011) AlignHUSH: alignment of HMMs using structure and hydrophobicity information. BMC Bioinform 12:275. https://doi.org/10.1186/1471-2105-12-275

    Article  CAS  Google Scholar 

  23. Mistry J, Finn RD, Eddy SR, Bateman A, Punta M (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41:e121–e121. https://doi.org/10.1093/nar/gkt263

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, Söding J (2019) HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform 20:473. https://doi.org/10.1186/s12859-019-3019-7

    Article  CAS  Google Scholar 

  25. Margelevičius M, Venclovas Č (2005) PSI-BLAST-ISS: an intermediate sequence search tool for estimation of the position-specific alignment reliability. BMC Bioinform 6:185. https://doi.org/10.1186/1471-2105-6-185

    Article  CAS  Google Scholar 

  26. Pandurangan AP, Stahlhacke J, Oates ME, Smithers B, Gough J (2019) The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res 47:D490–D494. https://doi.org/10.1093/nar/gky1130

    Article  CAS  PubMed  Google Scholar 

  27. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. https://doi.org/10.1371/journal.pcbi.1002195

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Johnson LS, Eddy SR, Portugaly E (2010) Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform 11:431. https://doi.org/10.1186/1471-2105-11-431

    Article  CAS  Google Scholar 

  29. Scheeff ED, Bourne PE (2006) Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction. BMC Bioinform 7:410. https://doi.org/10.1186/1471-2105-7-410

    Article  CAS  Google Scholar 

  30. Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. https://doi.org/10.1093/bioinformatics/bti125

    Article  PubMed  Google Scholar 

  31. Park J, Teichmann SA, Hubbard T, Chothia C (1997) Intermediate sequences increase the detection of homology between sequences. J Mol Biol 273:349–354. https://doi.org/10.1006/jmbi.1997.1288

    Article  CAS  PubMed  Google Scholar 

  32. Salamov AA, Suwa M, Orengo CA, Swindells MB (1999) Combining sensitive database searches with multiple intermediates to detect distant homologues. Protein Eng 12:95–100. https://doi.org/10.1093/protein/12.2.95

    Article  CAS  PubMed  Google Scholar 

  33. Li W, Pio F, Pawlowski K, Godzik A (2000) Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology. Bioinformatics 16:1105–1110. https://doi.org/10.1093/bioinformatics/16.12.1105

    Article  CAS  PubMed  Google Scholar 

  34. John B, Sali A (2004) Detection of homologous proteins by an intermediate sequence search. Protein Sci 13:54–62. https://doi.org/10.1110/ps.03335004

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Teichmann SA, Chothia C, Church GM, Park J (2000) Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics 16:117–124. https://doi.org/10.1093/bioinformatics/16.2.117

    Article  CAS  PubMed  Google Scholar 

  36. Sandhya S, Mudgal R, Jayadev C, Abhinandan KR, Sowdhamini R, Srinivasan N (2012) Cascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins. Mol BioSyst 8:2076–2084. https://doi.org/10.1039/c2mb25113b

    Article  CAS  PubMed  Google Scholar 

  37. Mudgal R, Sandhya S, Kumar G, Sowdhamini R, Chandra NR, Srinivasan N (2014) NrichD database: sequence databases enriched with computationally designed protein-like sequences aid in remote homology detection. Nucleic Acids Res 43:D300–D305. https://doi.org/10.1093/nar/gku888

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Mudgal R, Sowdhamini R, Chandra N, Srinivasan N, Sandhya S (2014) Filling-in void and sparse regions in protein sequence space by protein-like artificial sequences enables remarkable enhancement in remote homology detection capability. J Mol Biol 426:962–979. https://doi.org/10.1016/j.jmb.2013.11.026

    Article  CAS  PubMed  Google Scholar 

  39. Mudgal R, Sandhya S, Kumar G, Sowdhamini R, Chandra NR, Srinivasan N (2015) NrichD database: sequence databases enriched with computationally designed protein-like sequences aid in remote homology detection. Nucleic Acids Res 43:D300–D305. https://doi.org/10.1093/nar/gku888

    Article  CAS  PubMed  Google Scholar 

  40. Mudgal R, Sandhya S, Chandra N, Srinivasan N (2015) De-DUFing the DUFs: deciphering distant evolutionary relationships of domains of unknown function using sensitive homology detection methods. Biol Direct 10:38. https://doi.org/10.1186/s13062-015-0069-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Kumar G, Srinivasan N, Sandhya S (2020) Artificial protein sequences enable recognition of vicinal and distant protein functional relationships. Proteins Struct Funct Bioinform 88:1688–1700. https://doi.org/10.1002/prot.25986

    Article  CAS  Google Scholar 

  42. Sandhya S, Mudgal R, Kumar G, Sowdhamini R, Srinivasan N (2016) Protein sequence design and its applications. Curr Opin Struct Biol 37:71–80. https://doi.org/10.1016/j.sbi.2015.12.004

    Article  CAS  PubMed  Google Scholar 

  43. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD (2019) The PFAM protein families database in 2019. Nucleic Acids Res 47:D427–D432. https://doi.org/10.1093/nar/gky995

    Article  CAS  PubMed  Google Scholar 

  44. Hubbard TJP, Ailey B, Brenner SE, Murzin AG, Chothia C (1999) SCOP: a structural classification of proteins database. Nucleic Acids Res 27:254–256. https://doi.org/10.1093/nar/27.1.254

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh L-SL (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res 32:115D–119D. https://doi.org/10.1093/nar/gkh131

    Article  CAS  Google Scholar 

  46. Schaffer AA (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994–3005. https://doi.org/10.1093/nar/29.14.2994

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Altschul S, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol Syst Biol 7. https://doi.org/10.1038/msb.2011.75

  49. Chandonia J-M, Fox NK, Brenner SE (2019) SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database. Nucleic Acids Res 47:D475–D481. https://doi.org/10.1093/nar/gky1134

    Article  CAS  PubMed  Google Scholar 

  50. Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O’Donovan C, Martin M-J, Kleywegt GJ (2012) SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res 41:D483–D489. https://doi.org/10.1093/nar/gks1258

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:29–37. https://doi.org/10.1093/nar/gkr367

    Article  CAS  Google Scholar 

  52. Xu Q, Dunbrack RL (2012) Assignment of protein sequences to existing domain and family classification systems: PfamPFAM and the PDB. Bioinformatics 28:2763–2772. https://doi.org/10.1093/bioinformatics/bts533

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJA, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C (2009) InterPro: the integrative protein signature database. Nucleic Acids Res 37:D211–D215. https://doi.org/10.1093/nar/gkn785

    Article  CAS  PubMed  Google Scholar 

  54. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066

    Article  CAS  Google Scholar 

  55. Pei J, Grishin NV (2014) PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information. In: Methods in molecular biology (Clifton, N.J.), pp 263–271

    Google Scholar 

  56. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices11Edited by G. Von Heijne. J Mol Biol 292:195–202. https://doi.org/10.1006/jmbi.1999.3091

    Article  CAS  PubMed  Google Scholar 

  57. Bateman A, Finn RD (2007) SCOOP: a simple method for identification of novel protein superfamily relationships. Bioinformatics 23:809–814. https://doi.org/10.1093/bioinformatics/btm034

    Article  CAS  PubMed  Google Scholar 

  58. Chen L, Shi K, Yin Z, Aihara H (2013) Structural asymmetry in the Thermus thermophilus RuvC dimer suggests a basis for sequential strand cleavages during Holliday junction resolution. Nucleic Acids Res 41:648–656. https://doi.org/10.1093/nar/gks1015

    Article  CAS  PubMed  Google Scholar 

  59. Yoshikawa M, Iwasaki H, Kinoshita K, Shinagawa H (2000) Two basic residues, Lys-107 and Lys-118, of RuvC resolvase are involved in critical contacts with the Holliday junction for its resolution. Genes Cells 5:803–813. https://doi.org/10.1046/j.1365-2443.2000.00371.x

    Article  CAS  PubMed  Google Scholar 

  60. Singarapu KK, Liu G, Xiao R, Bertonati C, Honig B, Montelione GT, Szyperski T (2007) NMR structure of protein yjbR from Escherichia coli reveals “double-wing” DNA binding motif. Proteins Struct Funct Genet 67:501–504. https://doi.org/10.1002/prot.21297

    Article  CAS  PubMed  Google Scholar 

  61. Feldmann EA, Seetharaman J, Ramelot TA, Lew S, Zhao L, Hamilton K, Ciccosanti C, Xiao R, Acton TB, Everett JK, Tong L, Montelione GT, Kennedy MA (2012) Solution NMR and X-ray crystal structures of pseudomonas syringae Pspto-3016 from protein domain family PF04237 (DUF419) adopt a “double wing” DNA binding motif. J Struct Funct Genom 13:155–162. https://doi.org/10.1007/s10969-012-9140-8

    Article  CAS  Google Scholar 

Download references

Acknowledgments

This research is supported by Mathematical Biology program and FIST program, sponsored by the Department of Science and Technology and also by the Department of Biotechnology, Government of India, in the form of IISc-DBT partnership program. We also gratefully acknowledge support from Bioinformatics and Computational Biology Centre, funded by DBT and support from UGC, India—Centre for Advanced Studies and Ministry of Human Resource Development, India. NS is a J. C. Bose National Fellow. SS was supported as a post-doctoral fellow by the DBT-IISc partnership program and is currently affiliated with M.S. Ramaiah University of Applied Sciences, Bangalore this paper is dedicated to one of the authors of the paper, Prof N. Srinivasan, who passed away on September 03, 2021.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Narayanaswamy Srinivasan or Sankaran Sandhya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Kumar, G., Srinivasan, N., Sandhya, S. (2022). Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 2449. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2095-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-2095-3_5

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-2094-6

  • Online ISBN: 978-1-0716-2095-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics