Skip to main content

Transcription Factors Across the Escherichia coli Pangenome: A 3D Perspective

  • Conference paper
  • First Online:
Comparative Genomics (RECOMB-CG 2024)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 14616))

Included in the following conference series:

  • 75 Accesses

Abstract

Identification of complete sets of transcription factors (TFs) is a foundational step in the inference of genetic regulatory networks. The availability of high-quality predictions of protein three-dimensional structures (3D), has made it possible to use structural similarities for the inference of homology beyond what is possible from sequence analyses alone. This work explores the potential to use such 3D structures for the identification of TFs in the Escherichia coli pangenome. Comparisons between predicted structures and their experimental counterparts confirmed the high-quality of predicted structures, with most 3D structural alignments showing TM-scores well above established structural similarity thresholds, though the quality seemed slightly lower for TFs than for other proteins. As expected, structural similarity decreased with sequence similarity, though most TM-scores still remained well above the structural similarity threshold. This was true regardless of the aligned structures being experimental or predicted. Results at the lowest sequence identity levels revealed potential for 3D structural comparisons to extend homology inferences below the “twilight zone” of sequence-based methods. The body of predicted 3D structures covered 99.7% of available proteins from the E. coli pangenome, missing only two of those matching TF domain sequence profiles. After cleaning out potential false positives, structural comparisons increased the number of inferred TFs in the pangenome by 50% over those obtained with sequence profiles alone, which translated into around 20% more TFs per genome.

Supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aleksander, S.A., et al.: The gene ontology knowledgebase in 2023. Genetics 224(1), iyad031 (2023). https://doi.org/10.1093/genetics/iyad031

    Article  Google Scholar 

  2. Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021). https://doi.org/10.1126/science.abj8754

    Article  Google Scholar 

  3. Barrio-Hernandez, I., et al.: Clustering-predicted structures at the scale of the known protein universe. Nature 1–9 (2023). https://doi.org/10.1038/s41586-023-06510-w

  4. Bittrich, S., et al.: RCSB protein data bank: efficient searching and simultaneous access to one million computed structure moddels alongside the pdb structures enabled by architectural advances. J. Mol. Biol. 167994 (2023). https://doi.org/10.1016/j.jmb.2023.167994

  5. Buchfink, B., Reuter, K., Drost, H.G.: Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18(4), 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x

    Article  Google Scholar 

  6. Burley, S.K.: An overview of structural genomics. Nat. Struct. Biol. 7(Suppl 11), 932–934 (2000). https://doi.org/10.1038/80697

    Article  Google Scholar 

  7. Burley, S.K., et al.: Structural genomics: beyond the human genome project. Nat. Genet. 23(2), 151–157 (1999). https://doi.org/10.1038/13783

    Article  Google Scholar 

  8. Camacho, C., et al.: BLAST+: architecture and applications. BMC Bioinform. 10(1), 421 (2009). https://doi.org/10.1186/1471-2105-10-421

    Article  MathSciNet  Google Scholar 

  9. Freyre-González, J.A., Treviño-Quintanilla, L.G., Valtierra-Gutiérrez, I.A., Gutiérrez-Ríos, R.M., Alonso-Pavón, J.A.: Prokaryotic regulatory systems biology: common principles governing the functional architectures of Bacillus subtilis and Escherichia coli unveiled by the natural decomposition approach. J. Biotechnol. 161(3), 278–286 (2012). https://doi.org/10.1016/j.jbiotec.2012.03.028

    Article  Google Scholar 

  10. Haft, D.H., Badretdin, A., et al.: RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res. 52(D1), D762–D769 (2023). https://doi.org/10.1093/nar/gkad988

    Article  Google Scholar 

  11. Hernández-Salmerón, J.E., Irani, T., Moreno-Hagelsieb, G.: Fast genome-based delimitation of Enterobacterales species. PLoS ONE 18(9), e0291492 (2023). https://doi.org/10.1371/journal.pone.0291492

    Article  Google Scholar 

  12. Hernández-Salmerón, J.E., Moreno-Hagelsieb, G.: Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC Genom. 21(1), 741 (2020). https://doi.org/10.1186/s12864-020-07132-6

    Article  Google Scholar 

  13. Illergård, K., Ardell, D.H., Elofsson, A.: Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins: Struct. Funct. Bioinform. 77(3), 499–508 (2009). https://doi.org/10.1002/prot.22458

    Article  Google Scholar 

  14. Johnson, L.S., Eddy, S.R., Portugaly, E.: Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform. 11(1), 431 (2010). https://doi.org/10.1186/1471-2105-11-431

    Article  Google Scholar 

  15. Kempen, M.V., et al.: Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 1–4 (2023). https://doi.org/10.1038/s41587-023-01773-0

  16. Leman, J.K., et al.: Sequence-structure-function relationships in the microbial protein universe. Nat. Commun. 14(1), 2351 (2023). https://doi.org/10.1038/s41467-023-37896-w

    Article  MathSciNet  Google Scholar 

  17. Lu, S., et al.: CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 48(D1), D265–D268 (2019). https://doi.org/10.1093/nar/gkz991

    Article  MathSciNet  Google Scholar 

  18. Mistry, J., et al.: Pfam: the protein families database in 2021. Nucleic Acids Res. 49(D1), gkaa913 (2020). https://doi.org/10.1093/nar/gkaa913

  19. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016). https://doi.org/10.1186/s13059-016-0997-x

    Article  Google Scholar 

  20. Paysan-Lafosse, T., et al.: InterPro in 2022. Nucleic Acids Res. 51(D1), D418–D427 (2022). https://doi.org/10.1093/nar/gkac993

    Article  Google Scholar 

  21. Sanchez, I., Hernandez-Guerrero, R., Mendez-Monroy, P.E., Martinez-Nuñez, M.A., Ibarra, J.A., Pérez-Rueda, E.: Evaluation of the abundance of DNA-binding transcription factors in prokaryotes. Genes 11(1), 52 (2020). https://doi.org/10.3390/genes11010052

    Article  Google Scholar 

  22. Sayers, E.W., et al.: Database resources of the national center for biotechnology information in 2023. Nucleic Acids Res. 51(D1), gkac1032 (2022). https://doi.org/10.1093/nar/gkac1032

  23. Skolnick, J., Fetrow, J.S., Kolinski, A.: Structural genomics and its importance for gene function analysis. Nat. Biotechnol. 18(3), 283–287 (2000). https://doi.org/10.1038/73723

    Article  Google Scholar 

  24. Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017). https://doi.org/10.1038/nbt.3988

    Article  Google Scholar 

  25. The UniProt Consortium, et al.: UniProt: the universal protein knowledge base in 2023. Nucleic Acids Res. 51(D1), D523–D531 (2022). https://doi.org/10.1093/nar/gkac1052

  26. Tierrafría, V.H., et al.: RegulonDB 11.0: comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb. Genom. 8(5), mgen000833 (2022). https://doi.org/10.1099/mgen.0.000833

  27. Varadi, M., et al.: AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2021). https://doi.org/10.1093/nar/gkab1061

    Article  Google Scholar 

  28. Varadi, M., et al.: AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52(D1), D368–D375 (2023). https://doi.org/10.1093/nar/gkad1011

    Article  Google Scholar 

  29. Xu, J., Zhang, Y.: How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26(7), 889–895 (2010). https://doi.org/10.1093/bioinformatics/btq066

  30. Zhang, C., Shine, M., Pyle, A.M., Zhang, Y.: US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19(9), 1109–1115 (2022). https://doi.org/10.1038/s41592-022-01585-1

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Moreno-Hagelsieb .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moreno-Hagelsieb, G. (2024). Transcription Factors Across the Escherichia coli Pangenome: A 3D Perspective. In: Scornavacca, C., Hernández-Rosales, M. (eds) Comparative Genomics. RECOMB-CG 2024. Lecture Notes in Computer Science(), vol 14616. Springer, Cham. https://doi.org/10.1007/978-3-031-58072-7_11

Download citation

Publish with us

Policies and ethics