Transcription Factors Across the Escherichia coli Pangenome: A 3D Perspective

Moreno-Hagelsieb, Gabriel

doi:10.1007/978-3-031-58072-7_11

Gabriel Moreno-Hagelsieb ORCID: orcid.org/0000-0002-2457-4450⁹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 14616))

Included in the following conference series:

RECOMB International Workshop on Comparative Genomics

75 Accesses

Abstract

Identification of complete sets of transcription factors (TFs) is a foundational step in the inference of genetic regulatory networks. The availability of high-quality predictions of protein three-dimensional structures (3D), has made it possible to use structural similarities for the inference of homology beyond what is possible from sequence analyses alone. This work explores the potential to use such 3D structures for the identification of TFs in the Escherichia coli pangenome. Comparisons between predicted structures and their experimental counterparts confirmed the high-quality of predicted structures, with most 3D structural alignments showing TM-scores well above established structural similarity thresholds, though the quality seemed slightly lower for TFs than for other proteins. As expected, structural similarity decreased with sequence similarity, though most TM-scores still remained well above the structural similarity threshold. This was true regardless of the aligned structures being experimental or predicted. Results at the lowest sequence identity levels revealed potential for 3D structural comparisons to extend homology inferences below the “twilight zone” of sequence-based methods. The body of predicted 3D structures covered 99.7% of available proteins from the E. coli pangenome, missing only two of those matching TF domain sequence profiles. After cleaning out potential false positives, structural comparisons increased the number of inferred TFs in the pangenome by 50% over those obtained with sequence profiles alone, which translated into around 20% more TFs per genome.

Supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aleksander, S.A., et al.: The gene ontology knowledgebase in 2023. Genetics 224(1), iyad031 (2023). https://doi.org/10.1093/genetics/iyad031
Article Google Scholar
Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021). https://doi.org/10.1126/science.abj8754
Article Google Scholar
Barrio-Hernandez, I., et al.: Clustering-predicted structures at the scale of the known protein universe. Nature 1–9 (2023). https://doi.org/10.1038/s41586-023-06510-w
Bittrich, S., et al.: RCSB protein data bank: efficient searching and simultaneous access to one million computed structure moddels alongside the pdb structures enabled by architectural advances. J. Mol. Biol. 167994 (2023). https://doi.org/10.1016/j.jmb.2023.167994
Buchfink, B., Reuter, K., Drost, H.G.: Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18(4), 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x
Article Google Scholar
Burley, S.K.: An overview of structural genomics. Nat. Struct. Biol. 7(Suppl 11), 932–934 (2000). https://doi.org/10.1038/80697
Article Google Scholar
Burley, S.K., et al.: Structural genomics: beyond the human genome project. Nat. Genet. 23(2), 151–157 (1999). https://doi.org/10.1038/13783
Article Google Scholar
Camacho, C., et al.: BLAST+: architecture and applications. BMC Bioinform. 10(1), 421 (2009). https://doi.org/10.1186/1471-2105-10-421
Article MathSciNet Google Scholar
Freyre-González, J.A., Treviño-Quintanilla, L.G., Valtierra-Gutiérrez, I.A., Gutiérrez-Ríos, R.M., Alonso-Pavón, J.A.: Prokaryotic regulatory systems biology: common principles governing the functional architectures of Bacillus subtilis and Escherichia coli unveiled by the natural decomposition approach. J. Biotechnol. 161(3), 278–286 (2012). https://doi.org/10.1016/j.jbiotec.2012.03.028
Article Google Scholar
Haft, D.H., Badretdin, A., et al.: RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res. 52(D1), D762–D769 (2023). https://doi.org/10.1093/nar/gkad988
Article Google Scholar
Hernández-Salmerón, J.E., Irani, T., Moreno-Hagelsieb, G.: Fast genome-based delimitation of Enterobacterales species. PLoS ONE 18(9), e0291492 (2023). https://doi.org/10.1371/journal.pone.0291492
Article Google Scholar
Hernández-Salmerón, J.E., Moreno-Hagelsieb, G.: Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC Genom. 21(1), 741 (2020). https://doi.org/10.1186/s12864-020-07132-6
Article Google Scholar
Illergård, K., Ardell, D.H., Elofsson, A.: Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins: Struct. Funct. Bioinform. 77(3), 499–508 (2009). https://doi.org/10.1002/prot.22458
Article Google Scholar
Johnson, L.S., Eddy, S.R., Portugaly, E.: Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform. 11(1), 431 (2010). https://doi.org/10.1186/1471-2105-11-431
Article Google Scholar
Kempen, M.V., et al.: Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 1–4 (2023). https://doi.org/10.1038/s41587-023-01773-0
Leman, J.K., et al.: Sequence-structure-function relationships in the microbial protein universe. Nat. Commun. 14(1), 2351 (2023). https://doi.org/10.1038/s41467-023-37896-w
Article MathSciNet Google Scholar
Lu, S., et al.: CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 48(D1), D265–D268 (2019). https://doi.org/10.1093/nar/gkz991
Article MathSciNet Google Scholar
Mistry, J., et al.: Pfam: the protein families database in 2021. Nucleic Acids Res. 49(D1), gkaa913 (2020). https://doi.org/10.1093/nar/gkaa913
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016). https://doi.org/10.1186/s13059-016-0997-x
Article Google Scholar
Paysan-Lafosse, T., et al.: InterPro in 2022. Nucleic Acids Res. 51(D1), D418–D427 (2022). https://doi.org/10.1093/nar/gkac993
Article Google Scholar
Sanchez, I., Hernandez-Guerrero, R., Mendez-Monroy, P.E., Martinez-Nuñez, M.A., Ibarra, J.A., Pérez-Rueda, E.: Evaluation of the abundance of DNA-binding transcription factors in prokaryotes. Genes 11(1), 52 (2020). https://doi.org/10.3390/genes11010052
Article Google Scholar
Sayers, E.W., et al.: Database resources of the national center for biotechnology information in 2023. Nucleic Acids Res. 51(D1), gkac1032 (2022). https://doi.org/10.1093/nar/gkac1032
Skolnick, J., Fetrow, J.S., Kolinski, A.: Structural genomics and its importance for gene function analysis. Nat. Biotechnol. 18(3), 283–287 (2000). https://doi.org/10.1038/73723
Article Google Scholar
Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017). https://doi.org/10.1038/nbt.3988
Article Google Scholar
The UniProt Consortium, et al.: UniProt: the universal protein knowledge base in 2023. Nucleic Acids Res. 51(D1), D523–D531 (2022). https://doi.org/10.1093/nar/gkac1052
Tierrafría, V.H., et al.: RegulonDB 11.0: comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb. Genom. 8(5), mgen000833 (2022). https://doi.org/10.1099/mgen.0.000833
Varadi, M., et al.: AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2021). https://doi.org/10.1093/nar/gkab1061
Article Google Scholar
Varadi, M., et al.: AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52(D1), D368–D375 (2023). https://doi.org/10.1093/nar/gkad1011
Article Google Scholar
Xu, J., Zhang, Y.: How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26(7), 889–895 (2010). https://doi.org/10.1093/bioinformatics/btq066
Zhang, C., Shine, M., Pyle, A.M., Zhang, Y.: US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19(9), 1109–1115 (2022). https://doi.org/10.1038/s41592-022-01585-1
Article Google Scholar

Download references

Author information

Authors and Affiliations

Wilfrid Laurier University, Waterloo, ON, N2L 3C5, Canada
Gabriel Moreno-Hagelsieb

Authors

Gabriel Moreno-Hagelsieb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriel Moreno-Hagelsieb .

Editor information

Editors and Affiliations

CNRS, University of Montpellier, Montpellier, France
Celine Scornavacca
Center for Research and Advanced Studies, Irapuato, Mexico
Maribel Hernández-Rosales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moreno-Hagelsieb, G. (2024). Transcription Factors Across the Escherichia coli Pangenome: A 3D Perspective. In: Scornavacca, C., Hernández-Rosales, M. (eds) Comparative Genomics. RECOMB-CG 2024. Lecture Notes in Computer Science(), vol 14616. Springer, Cham. https://doi.org/10.1007/978-3-031-58072-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-58072-7_11
Published: 15 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58071-0
Online ISBN: 978-3-031-58072-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Transcription Factors Across the Escherichia coli Pangenome: A 3D Perspective