Abstract
Identification of complete sets of transcription factors (TFs) is a foundational step in the inference of genetic regulatory networks. The availability of high-quality predictions of protein three-dimensional structures (3D), has made it possible to use structural similarities for the inference of homology beyond what is possible from sequence analyses alone. This work explores the potential to use such 3D structures for the identification of TFs in the Escherichia coli pangenome. Comparisons between predicted structures and their experimental counterparts confirmed the high-quality of predicted structures, with most 3D structural alignments showing TM-scores well above established structural similarity thresholds, though the quality seemed slightly lower for TFs than for other proteins. As expected, structural similarity decreased with sequence similarity, though most TM-scores still remained well above the structural similarity threshold. This was true regardless of the aligned structures being experimental or predicted. Results at the lowest sequence identity levels revealed potential for 3D structural comparisons to extend homology inferences below the “twilight zone” of sequence-based methods. The body of predicted 3D structures covered 99.7% of available proteins from the E. coli pangenome, missing only two of those matching TF domain sequence profiles. After cleaning out potential false positives, structural comparisons increased the number of inferred TFs in the pangenome by 50% over those obtained with sequence profiles alone, which translated into around 20% more TFs per genome.
Supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aleksander, S.A., et al.: The gene ontology knowledgebase in 2023. Genetics 224(1), iyad031 (2023). https://doi.org/10.1093/genetics/iyad031
Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021). https://doi.org/10.1126/science.abj8754
Barrio-Hernandez, I., et al.: Clustering-predicted structures at the scale of the known protein universe. Nature 1–9 (2023). https://doi.org/10.1038/s41586-023-06510-w
Bittrich, S., et al.: RCSB protein data bank: efficient searching and simultaneous access to one million computed structure moddels alongside the pdb structures enabled by architectural advances. J. Mol. Biol. 167994 (2023). https://doi.org/10.1016/j.jmb.2023.167994
Buchfink, B., Reuter, K., Drost, H.G.: Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18(4), 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x
Burley, S.K.: An overview of structural genomics. Nat. Struct. Biol. 7(Suppl 11), 932–934 (2000). https://doi.org/10.1038/80697
Burley, S.K., et al.: Structural genomics: beyond the human genome project. Nat. Genet. 23(2), 151–157 (1999). https://doi.org/10.1038/13783
Camacho, C., et al.: BLAST+: architecture and applications. BMC Bioinform. 10(1), 421 (2009). https://doi.org/10.1186/1471-2105-10-421
Freyre-González, J.A., Treviño-Quintanilla, L.G., Valtierra-Gutiérrez, I.A., Gutiérrez-Ríos, R.M., Alonso-Pavón, J.A.: Prokaryotic regulatory systems biology: common principles governing the functional architectures of Bacillus subtilis and Escherichia coli unveiled by the natural decomposition approach. J. Biotechnol. 161(3), 278–286 (2012). https://doi.org/10.1016/j.jbiotec.2012.03.028
Haft, D.H., Badretdin, A., et al.: RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res. 52(D1), D762–D769 (2023). https://doi.org/10.1093/nar/gkad988
Hernández-Salmerón, J.E., Irani, T., Moreno-Hagelsieb, G.: Fast genome-based delimitation of Enterobacterales species. PLoS ONE 18(9), e0291492 (2023). https://doi.org/10.1371/journal.pone.0291492
Hernández-Salmerón, J.E., Moreno-Hagelsieb, G.: Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC Genom. 21(1), 741 (2020). https://doi.org/10.1186/s12864-020-07132-6
Illergård, K., Ardell, D.H., Elofsson, A.: Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins: Struct. Funct. Bioinform. 77(3), 499–508 (2009). https://doi.org/10.1002/prot.22458
Johnson, L.S., Eddy, S.R., Portugaly, E.: Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform. 11(1), 431 (2010). https://doi.org/10.1186/1471-2105-11-431
Kempen, M.V., et al.: Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 1–4 (2023). https://doi.org/10.1038/s41587-023-01773-0
Leman, J.K., et al.: Sequence-structure-function relationships in the microbial protein universe. Nat. Commun. 14(1), 2351 (2023). https://doi.org/10.1038/s41467-023-37896-w
Lu, S., et al.: CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 48(D1), D265–D268 (2019). https://doi.org/10.1093/nar/gkz991
Mistry, J., et al.: Pfam: the protein families database in 2021. Nucleic Acids Res. 49(D1), gkaa913 (2020). https://doi.org/10.1093/nar/gkaa913
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016). https://doi.org/10.1186/s13059-016-0997-x
Paysan-Lafosse, T., et al.: InterPro in 2022. Nucleic Acids Res. 51(D1), D418–D427 (2022). https://doi.org/10.1093/nar/gkac993
Sanchez, I., Hernandez-Guerrero, R., Mendez-Monroy, P.E., Martinez-Nuñez, M.A., Ibarra, J.A., Pérez-Rueda, E.: Evaluation of the abundance of DNA-binding transcription factors in prokaryotes. Genes 11(1), 52 (2020). https://doi.org/10.3390/genes11010052
Sayers, E.W., et al.: Database resources of the national center for biotechnology information in 2023. Nucleic Acids Res. 51(D1), gkac1032 (2022). https://doi.org/10.1093/nar/gkac1032
Skolnick, J., Fetrow, J.S., Kolinski, A.: Structural genomics and its importance for gene function analysis. Nat. Biotechnol. 18(3), 283–287 (2000). https://doi.org/10.1038/73723
Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017). https://doi.org/10.1038/nbt.3988
The UniProt Consortium, et al.: UniProt: the universal protein knowledge base in 2023. Nucleic Acids Res. 51(D1), D523–D531 (2022). https://doi.org/10.1093/nar/gkac1052
Tierrafría, V.H., et al.: RegulonDB 11.0: comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb. Genom. 8(5), mgen000833 (2022). https://doi.org/10.1099/mgen.0.000833
Varadi, M., et al.: AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2021). https://doi.org/10.1093/nar/gkab1061
Varadi, M., et al.: AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52(D1), D368–D375 (2023). https://doi.org/10.1093/nar/gkad1011
Xu, J., Zhang, Y.: How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26(7), 889–895 (2010). https://doi.org/10.1093/bioinformatics/btq066
Zhang, C., Shine, M., Pyle, A.M., Zhang, Y.: US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19(9), 1109–1115 (2022). https://doi.org/10.1038/s41592-022-01585-1
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Moreno-Hagelsieb, G. (2024). Transcription Factors Across the Escherichia coli Pangenome: A 3D Perspective. In: Scornavacca, C., Hernández-Rosales, M. (eds) Comparative Genomics. RECOMB-CG 2024. Lecture Notes in Computer Science(), vol 14616. Springer, Cham. https://doi.org/10.1007/978-3-031-58072-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-58072-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58071-0
Online ISBN: 978-3-031-58072-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)