1 Introduction

Enzymes are becoming increasingly valuable for the development of industrial catalysts due to their ability to significantly enhance the rate of biochemical reactions. High efficiency and selectivity are crucial characteristics for choosing enzymes for commercial applications in bio-catalysis, biofuels, and bioremediation. It is not surprising that microbial enzymes make up most commercial enzymes (88%) used in industry, as they offer several advantages over plant- or animal-derived enzymes [1, 2]. These advantages include higher stability, greater production yield, easier optimization, and increased cost-effectiveness in the industrial applications [3, 4]. Despite their advantages, the number of commercially available microbial enzymes is limited. The use of traditional culture-dependent microbiological methods to screen natural diversity for unknown enzymes is a common approach to obtaining appropriate microbial biocatalysts with desirable properties. This approach involves enriching microorganisms from environmental samples in the presence of appropriate substrates, isolating pure cultures, and screening microbial isolates to ultimately identify enzymes of interest [5]. While this method has proven successful in identifying many commercially available enzymes, more than 99% of microorganisms present in environmental samples cannot be cultured using standard laboratory techniques [6]. The consequence of this limitation is a potential loss of microbial diversity and the opportunity to discover novel enzymes with desired catalytic properties.

Advances in next-generation sequencing technologies have made it possible to access the genome sequences of all microorganisms present in an environment, without the need for their isolation and cultivation. The process of subjecting the DNA extracted from a community of microorganisms recovered from an environmental sample to whole-genome shotgun sequencing is referred to as metagenomics [7, 8]. The method enables direct sequencing of environmental DNA (eDNA) to explore community diversity, functional activities, and interactions of microorganisms inhabiting a specific environment [9]. Metagenomic sequences can be assembled de novo into contigs that represent the genomic segments of microorganisms from which they originate. This allows us to access the coding sequences of enzymes from uncultured microorganisms and predict their functional potentials under specific environmental conditions.

This approach can be used to explore the genomic sequences of unknown microorganisms residing in an environment for the discovery of novel enzymes with improved catalytic properties [10, 11]. It is demonstrated by the steadily increased number of predicted protein-coding sequences from metagenome sequencing of microbial communities obtained from diverse environments. Despite the current annotation pipelines, a significant portion of these sequences remains functionally uncharacterized, leaving many of them as unknown entities. The challenge is further magnified during the identification of a particular enzyme with an enhanced catalytic property. To tackle this problem, there is a growing interest in developing novel computational tools that can model the catalytic properties of enzymes by utilizing shared structural and functional features preserved in their amino acid sequences.

The current experimental approaches for identifying and characterizing new enzymes are limited in terms of speed and throughput, resulting in a gap between the numbers of discovered sequences and enzymes that are experimentally characterized with respect to their catalytic properties [12]. Assaying the activity of these enzyme sequences, especially the large number of novel enzymes predicted from metagenomic sequences, is impractical. Additionally, to realize the industrial application of an enzyme, it needs to be designed to meet specific process requirements [13]. All these limitations highlight the importance of physicochemical and structural features to be considered when searching for enzymes with properties suitable for a specific industrial or biotechnological application. The current approaches for in-silico enzyme discovery rely on the properties of enzymes inferred from phylogenetic analyses, sequence similarity searches, genomic positional information, three-dimensional (3D) structural modeling, and predictions based on machine learning [14]. Phylogenetic analyses can help infer the common ancestral origin of enzymes with shared catalytic properties. The sequence divergence that occurs during natural evolution can introduce variability in catalytic properties. The inclusion of sequences from catalytically efficient enzymes in phylogenetic analyses can help to identify distantly related sequences that may possess novel functional activities. Deep learning models can be employed to predict the structures of target enzymes by utilizing multiple sequence alignments and protein contact maps of many metagenomic sequences [15, 16]. Sequence similarity networks are valuable tools for identifying new candidate protein subfamily clusters by leveraging pairwise sequence similarities [17]. Genomic context provides important information regarding substrates, cofactors, bioactivity, and other co-regulated genes associated with the target enzymes [14]. For example, enzymes targeting a specific glycan substrate can be identified based on their genomic localization in polysaccharide utilization loci [18, 19].

Most deciphered 3D structures to date pertain to the enzymes isolated from cultured organisms, leaving limited experimental evidence regarding the structural characteristics of enzymes discovered through metagenome sequencing. Predicting protein function from structural data presents a significant challenge due to numerous instances of highly conserved protein folds that catalyze different reactions [14]. However, performing protein structural similarity searches is a crucial step in narrowing down the sequences that encode enzymes of interest from metagenome datasets. This approach aids in gaining functional insights into unknown sequences, eliminating the need for costly wet lab experiments. When combined with sequence homology-based searches, this approach possesses significant power in shortlisting specific enzymes within vast metagenomic datasets. While the 3D structure of enzymes plays a crucial role in their functionality, its utilization in the in-silico bioprospecting of novel enzymes remains limited.

Although several review papers have investigated the application of function-based, homology-based, and machine-learning-based methods to identify, predict, and annotate enzyme-encoding sequences in metagenome data [14, 20,21,22,23,24], there is a lack of knowledge regarding the integration of structural data into annotation pipelines. In this review, we provide an overview of various approaches and pipelines currently employed to explore metagenomic sequences for the discovery of new enzymes. In particular, we will emphasize the significance of incorporating structural data when searching for potential biocatalysts and natural bioproducts in large metagenome datasets.

2 Bioprospecting of novel enzymes from environmental samples

Microbes residing in diverse environments, including soil, hydrothermal vents, saline or alkaline lakes, acid mine drainage, permafrost, hot springs, wastewater treatment sludges, and animal guts, offer the potential for discovering novel enzymatic processes [9]. The microbial communities inhabiting these environments are typically complex in terms of their composition and abundance. It is also worth noting that most members within these communities may not possess desired functions. As a result, comprehensive screening approaches are necessary to identify the desired enzymatic process in a complex environmental sample. Traditional screening approaches involved cultivating microorganisms under defined culture conditions and subsequently screening for microbial clones that exhibit the function of interest. As noted earlier, this approach is unable to capture all microbial diversity in the environment, a phenomenon known as the “great plate count anomaly” [25].

To address the limitations associated with culture-based screening approaches, culture-independent methods were introduced. These methods are classified into two major approaches: functional-based screening (FBS) and sequence-based screening (SBS). Figure 1 provides a comprehensive summary of culture-independent screening approaches commonly used to search for novel biocatalysts in eDNA.

Fig. 1
figure 1

Culture-independent screening methods for mining novel enzymes from environmental samples. Both function-based and sequence-based methods can benefit from the information gained through structural analysis to refine the initial list of candidate enzymes

2.1 Function-based screening (FBS)

FBS involves direct cloning of eDNA libraries into suitable vectors, followed by functional screening in surrogate hosts such as E. coli [26]. During the screening process, the clones are examined to identify enzymes capable of utilizing a specific substrate or producing a specific product. After identifying the clones with desired functional properties, the DNA encoding for the function of interest is sequenced to identify the gene responsible for observed enzymatic activity. It is important to note that while this method has the potential to screen thousands of eDNA libraries, there are several limitations. The method becomes labor-intensive due to the necessity of analyzing a large number of clones to encompass the entire range of microorganisms present in an environmental sample [27]. The lack of expression of DNA originating from distantly related microorganisms in the surrogate host may result in a reduced representation of significant diversity in the screened samples. In addition, if the desired function relies on the coordinated activity of multiple enzymes, it is essential for all the encoding genes clustered in the same genomic region to be recovered in a single clone [28].

One significant advantage of FBS is its independence from prior knowledge of gene sequences or even the existence of such enzymes. Nevertheless, screening a typical metagenome library necessitates the evaluation of a substantial number of clones. As the complexity of the library increases, this process becomes labor-intensive, time-consuming, and expensive. To expedite the screening process, robotic instruments have been developed, allowing for efficient processing of complex eDNA libraries at a rate of up to 10 million per day. This method enables assaying for a single substrate, thereby reducing the chance of identifying highly promiscuous and multi-functional enzymes [20]. The success of identifying a target enzyme depends on several factors, including the assay method, gene size, gene abundance in the metagenomic sample, host-vector system, and the efficiency of gene expression in the surrogate host [26, 29].

FBS has been widely used for screening novel enzymes, including cellulase [30], esterase [31, 32], carboxylesterase [33], and lipases [34] from diverse environmental sources. In a recent study using activity-based screening through complementary sequence and structure analyses, a novel esterase was isolated by investigating lipolytic enzymes from a compost metagenome library [35]. The same approach was used to identify four thermo-alkaliphilic glycosyl hydrolases from wheat straw-degrading microbial consortia [36]. These enzymes hold the potential for utilization in lignocellulosic biomass-degrading cocktails.

2.2 Sequence-based screening (SBS)

The conventional methods for enzyme discovery are generally laborious, costly, resource-intensive, and time-consuming, with no guarantee of success. Due to the availability of a vast number of manually curated protein sequences as well as experimentally characterized enzymes in public databases, the development of novel computational approaches has become imperative to leverage this information in the enzyme discovery process [2, 24]. Specifically, these valuable resources can be utilized to construct machine-learning models that can aid in biocatalyst prospecting. SBS methods expedite the discovery of novel enzymes while minimizing resource usage and achieving a higher success rate.

Prior to the emergence of metagenome sequencing, SBS relied predominantly on the design of primers or probes derived from conserved regions of known enzymes to amplify or screen eDNA libraries in the quest for novel enzyme sequences. This method allows for the identification of novel candidate variants of known enzyme sequences but does not possess the capability to discover entirely new enzymes [27]. Metagenome sequencing has revolutionized the field by enabling the sequencing of complete DNA extracted from a specific environmental sample [37]. Considering this capability, we can now delve into the genetic constituent of every microorganism present in any environment and gain access to all coding sequences, enabling the exploration of any enzymes. The primary challenge associated with this approach lies in accurately annotating the coding sequences predicted in metagenomic sequences. Presently, the annotation process relies on sequence homology searches against known genes or pathways available in public databases. However, the process lacks optimal efficiency, with over 40% of protein-coding sequences remaining unannotated and labeled as unknown or hypothetical. The situation becomes more complex when searching for an enzyme that catalyzes a specific hydrolytic or biosynthetic reaction within a vast number of protein-coding sequences predicted in a metagenomic dataset. The search for a new enzyme through bioprospecting of metagenomic sequences can be carried out by using two general approaches: de novo and reference-based, depending on the availability of known enzyme families [14]. The de novo discovery of new biocatalysts using SBS is challenging, particularly when there is no prior knowledge about the function of interest. Recent studies suggest that predicting protein structures and comparing structural models using residue-residue contact maps can be used to model unknown structures and assist in identifying new biocatalysts in metagenomic datasets [38, 39]. Reference-based methods can be employed when there is existing knowledge about the members of a specific class of enzymes, but the search is for enzymes with distinct functionality. Identifying new enzymes may be less challenging when there are experimentally characterized members, compared to situations where there is a lack of prior knowledge about enzyme function and structure. Robinson et al. [14] proposed a roadmap for metagenomic enzyme discovery, termed “enzyme expansion”, which aims to discover enzymes with novel catalysts, substrate specificities, and reaction conditions. Considering that both the reference-based and “enzyme expansion” methods aim to identify enzymes with novel catalytic properties, we have integrated them as a reference-based approach. It is clear that de novo and reference-based methods of enzyme discovery can effectively leverage homology-based (HB), structural-based (SB), and machine learning-based (MLB) analyses.

2.2.1 Homology-based (HB) analysis

Sequence homology search can be used to identify sequences that are closely or distantly related to known enzymes. The method holds significant potential for discovering novel functional homologs of known enzymes [2]. The approach is not only effective in expanding homologs of known enzymes but also capable of searching for enzymes with unique functions. The search is typically conducted on the sequences deposited in public databases, including Pfam, RefSeq, UniPort, and NCBI non-redundant protein (NCBI-nr). Due to inadequate annotations in most publicly available databases, the search results may include hits that are incorrectly annotated [15], thus the results must be manually curated. Several tools have been developed to facilitate the search for closely related sequences, including BLAST [40], DIAMOND [41], and USEARCH [42]. Search algorithms based on either profile Hidden Markov Models (HMMs) such as HMMER [43] or position-specific scoring matrices such as PSI-BLAST [44] can be utilized to identify distantly related sequences. In addition, automated annotation platforms such as MetaHMM [45] and ANASTASIA [46] have been developed to facilitate enzyme discovery through homology-based search. The success of the method hinges on selecting the appropriate target database for the homology search and ensuring the accuracy of annotations for the sequences within that database.

The sequence homology search can be utilized to narrow down a large set of ORFs predicted in a complex metagenome dataset, specifically focusing on enzymes with unique functional characteristics, including thermostability, pH stability, specific activity, and more [47]. In a study by Elbehery et al. [48], HB analysis was carried out to identify two antibiotic resistance genes from the metagenome of Atlantis II Deep Red Sea brine pool. Protein-coding sequences were annotated against sequences deposited in the Comprehensive Antibiotic Resistance Database (CARD, https://card.mcmaster.ca/) using BLASTx, leading to the successful identification of two ORFs encoding a class A beta-lactamase and an aminoglycoside-3’ phosphotransferase. The properties of these enzymes were further elucidated through 3D structure prediction. Garg et al. [49] applied HB analysis to identify a novel cellulase (Cel5R) from a soil metagenome. The enzyme was subsequently characterized for its salt- and heat-stable properties. The 3D structure of the enzyme was determined through crystallography. In the landmark study leading to the development of the ANASTASIA platform, a novel esterase named EstDZ4 was mined in a hot spring metagenome [46]. The HB analysis proved successful in identifying EstDZ4, which showed thermostable properties, making it a promising candidate for biotechnological application. The result of this study demonstrated the efficacy of in-silico analyses in identifying enzymes that exhibit remote similarity to known sequences.

2.2.2 Machine learning-based (MLB) analysis

In HB analysis, it is assumed that homologous sequences share similar functions. However, it is important to acknowledge that there can be exceptions to this rule, where two closely related sequences may possess different functions. Consequently, relying solely on sequence homology may lead to wrongly interpreted or overlooked functional variations. To address these limitations, additional analyses and experimental validations are often necessary to accurately determine functional attributes of closely related sequences. To incorporate additional features in function prediction, methods that leverage MLB analysis can be employed. MLB analysis utilizes advanced algorithms and models to learn patterns and relationships from various data sources, including sequence information, structural properties, physicochemical characteristics, and functional annotations [50, 51]. By considering a broader range of features, MLB analysis can enhance the accuracy and specificity of function prediction, enabling the identification of enzymes with unique and diverse functional characteristics. Moreover, MLB algorithms can detect non-linear relationships and patterns in the data, increasing the likelihood of discovering novel enzymes compared to HB analysis. MLB analysis has demonstrated its effectiveness in uncovering hidden functional relationships and facilitating the discovery of novel biocatalysts with specific catalytic activities and desirable properties.

Several MLB approaches have been developed for the functional classification of the enzymes. Table 1 lists some of the methods that utilize MLB models for the annotation of protein sequences and the prediction of EC numbers. It is important to note that while the methods presented in Table 1 primarily focus on identifying mono-functional enzymes, there are specialized tools such as mlDEEPre [52] that enable the prediction of both multi-functional and mono-functional enzymes.

Table 1 Machine learning algorithms that are used for EC number prediction (all methods are accessible through web server)
2.2.2.1 EzyPred

EzyPred takes a protein sequence as input and then determines whether it is an enzyme. It then proceeds to classify the enzyme into its respective EC number, main EC class, and subclass. The classification of protein sequences in EzyPred is achieved through the implementation of a machine learning approach known as "optimized evidence-theoretic k-nearest neighbor (OET-KNN)" in conjunction with two types of features to capture information about the protein sequence [53].

2.2.2.2 SVM-prot

SVM-prot was initially developed as a computational tool for predicting the EC number of enzymes. It utilizes a representation of the protein sequence using 13 different numerical properties. It employs composition, transition, and distribution to encode each property. The original version used support vector machines (SVM) as the classifier, while it was later updated to utilize two additional classifiers, namely K-nearest neighbors (KNN) and probabilistic neural networks, to expand its prediction capabilities [54]. The incorporation of newer classifiers has significantly improved the overall performance of the method in predicting the EC number of enzymes and their functionality.

2.2.2.3 DEEPre

DEEPre is an EC number prediction tool that employs two types of features for mapping a protein sequence into a numerical space [55]. Sequence length-dependent features, such as position-specific scoring matrices (PSSM), and sequence length-independent features, such as functional domain-based encoding are used as input to a deep learning model comprised of a convolutional neural network (CNN) and recurrent neural network (RNN). DEEPre can predict enzyme function on all four levels of the EC classification system.

2.2.2.4 ECPred

ECPred is another popular method for predicting the EC number of enzymes [56]. This method adopts an independent learning model for each EC number. The classification is carried out in two levels. In the first level, features based on PSSM and physicochemical properties are utilized, and an SVM classifier is employed. In the second level, features derived from sequence alignments are used for classification by a Nearest Neighbor (NN) classifier.

2.2.2.5 CLEAN

CLEAN is a ML algorithm to assign EC number to less-studied proteins or those with uncharacterized functions [57]. CLEAN utilizes a contrastive learning framework, enabling it to confidently assign EC numbers to understudied enzymes, correct mislabeled enzymes, and identify promiscuous enzymes with multiple EC numbers. The effectiveness of CLEAN has been demonstrated through systematic in silico and in vitro experiments.

2.2.2.6 HDMLF

HDMLF is a novel hierarchical dual-core multitask learning framework utilizing advanced deep learning techniques for protein sequence embedding and EC number prediction [58]. An attention layer and a greedy strategy optimize the EC prediction process, resulting in stable and superior performance compared to other representative methods. The tool is accessible through the user-friendly web platform ECRECer (https://ecrecer.biodesign.ac.cn) with a cloud-based serverless architecture and an offline package to enhance usability.

2.2.2.7 EnzBert

EnzBert is a transformer model for sequence-based protein functional annotation [59]. It predicts the functional enzyme annotations by taking into account only sequence features. When compared to state-of-the-art tools, this model demonstrates superior performance in predicting EC numbers. Specifically, the EnzBert model significantly enhanced accuracy in monofunctional enzyme class prediction and achieved a notable improvement in EC number predictions at level two within the benchmark dataset.

2.2.3 The integrative approaches based on homology and machine learning

Both HB and MLB approaches can be used to discover novel microbial enzymes from environmental samples. Integrating HB and MLB methods increases the accuracy of enzyme discovery and allows for the targeted mining of novel enzymes, thereby reducing the need for costly and time-consuming wet lab experiments. In a previous study, thermostable xylanases were identified by the HB method and further analyzed using an ML-aided approach based on random forest classification [60]. Specifically, they developed a ML model called TAXyl, based on a SVM, which was trained using various sequence-based and length-independent protein features. The model was designed to discriminate between sequences encoding non-thermophilic, thermophilic, and hyper-thermophilic xylanases. The model was successfully applied to predict three novel thermostable xylanases from sheep and cow rumen metagenomes.

Furthermore, by integrating HB and MLB approaches, the same group also developed an integrated tool called MCIC, which combines HB and MLB analyses to identify cellulases from metagenomic sequences [61]. MCIC focuses on screening novel cellulases based on their optimal pH and temperature dependencies. The machine learning model employed in MCIC was trained using various sequence-based features. The tool facilitates the comparison of metagenome datasets based on their cellulolytic capabilities. To validate the method, two candidate cellulase enzymes identified by MCIC were cloned and subjected to further characterization.

MeTarEnz (metagenomic targeted enzyme miner) (https://cbb.ut.ac.ir/MeTarEnz/) is a similar software providing various services for targeted isolation of different enzymes from user-defined databases. It accepts sequences in different formats including unassembled short reads, assembled contigs, and translated coding sequences. This software can also predict the optimum pH and temperature of lipolytic enzymes using regression models. It was implemented for an in-depth analysis of tannery wastewater metagenomic data followed by mining a thermophilic alkaline lipase [62].

2.3 Utilization of structural information

The primary goal of bioprospecting enzymes for many industrial applications is to identify those that exhibit optimal functionality under specific conditions. Overcoming obstacles and addressing challenges associated with screening methods will contribute to the development of novel tools and technologies for enzyme discovery through metagenomic analysis. By doing so, we can enhance the efficiency and effectiveness of the bioprospecting process, leading to the identification of enzymes with desired characteristics for various industrial applications. Both SBS and FBS methods generate extensive lists of candidate enzymes. However, characterizing these candidates and identifying specific enzymes with desired properties remains a challenging task. Structural analyses can play a crucial role in narrowing down the search space by reducing the candidate sequences to a limited subset. This targeted subset can then undergo further functional analysis through wet lab procedures. By integrating structural analyses, researchers can efficiently prioritize and focus their experimental efforts on a more manageable set of candidate enzymes, facilitating the identification of enzymes with the desired properties.

It is widely accepted that the 3D structure of an enzyme directly influences its function. However, there are also instances where proteins with similar sequences exhibit dissimilar structures [63]. Surprisingly, even highly similar sequences can lead to proteins with distinct structures. This observed structural dissimilarity often correlates with differences in their functions [63]. There are also examples of proteins with limited sequence similarities but the same folding structures, suggesting that conserved positions in proteins tend to preserve their folding and biological functionality [64]. These findings highlight the complex relationship between protein sequence, structure, and function, demonstrating that sequence similarity alone cannot reliably predict structural similarities or functional properties of enzymes.

The analysis of protein structure–function relationships can be conducted at three levels: amino acid sequence and composition, 3D structure, and spatial conformations of the active site [65]. Computational molecular simulation offers a robust approach for determining and analyzing enzyme structure, dynamics, and functional mechanisms within the framework of physical interactions. Analyzing the 3D structures of enzymes can provide valuable insights into their diverse properties, such as function, spatial conformation, thermal and pH stability.

Prominent methods for predicting 3D protein structures include comparative modeling and ab initio structure prediction [66]. Comparative modeling can be achieved through homology modeling or threading methods for fold recognition. In homology modeling, predictions are based on previously solved structures serving as templates, assuming that homologous proteins share similar 3D structures. Choosing an appropriate template model is crucial for achieving high-quality and accurate predictions. Threading methods involve scanning the primary structure of an unknown protein against a database of proteins with known structures [67, 68]. By employing scoring functions based on statistical or knowledge-based potentials, the compatibility of the query protein with known structure is evaluated. Commonly used tools for comparative modeling include I-TASSER [69], Phyre [70], MODELLER [71], SWISS-MODEL [72], and AlphaFold [73]. Particularly, AlphaFold represents a significant advancement in structure prediction methodologies, leveraging state-of-the-art neural network architectures and training procedures. By integrating evolutionary, physical, and geometric constraints specific to protein structures, AlphaFold achieves remarkable improvements in accuracy.

Ab initio protein structure modeling involves the prediction of protein structures from scratch, relying solely on physical forces and energy principles [74]. This approach is particularly valuable when experimental structural information or suitable template structures are unavailable. Various tools are available to perform ab initio structural prediction, each utilizing different algorithms and methodologies. Notable examples include GROMACS [75], NAMD [76], and TeraChem [77]. These tools employ advanced simulation techniques such as molecular dynamics to explore the conformational space and identify the most energetically favorable protein structure. By leveraging the principles of physics and energy minimization, ab initio modeling enables the generation of protein structures in the absence of prior structural knowledge.

Protein 3D structure modeling plays a crucial role in distinguishing proteins with similar sequences, allowing the exploration of hidden characteristics that cannot be revealed through conventional sequence homology searches alone. This capability becomes particularly valuable when searching for novel enzymes within protein sequences predicted from metagenome data. By providing detailed insights into the spatial arrangement of atoms within a protein, 3D structure modeling aids in the identification of unique structural features, functional regions, and key residues that contribute to enzyme activity. This deeper understanding of protein structure allows for more precise and comprehensive analysis, ultimately facilitating the discovery and characterization of novel enzymes with desired properties from metagenome-derived sequences.

The utilization of structural information has been extensively employed in enzyme bioprospecting from environmental samples, as demonstrated by various studies summarized in Table 2. The processes that lead to the identification of candidate enzymes are summarized into seven distinct stages (S1-S6), with each stage involving specific computational analyses. The different stages of enzyme bioprospecting and their corresponding computational analyses are presented in Table 2.

Table 2 The list of candidate enzymes discovered through integrated sequence and structure analyses

2.3.1 Predicting enzyme thermal stability through structural analysis

The 3D structure of native proteins is determined by a multitude of weak interactions, including hydrogen bonding, salt bridges, hydrophobic, and polar interactions. These non-covalent forces, along with covalent disulfide bonds between cysteine residues, play essential roles in stabilizing protein structure [78]. These interactions contribute to various structural properties such as protein stability, dynamics, recognition, catalysis, and degradation. Salt bridges are strong electrostatic interactions formed between negatively charged groups [79] that stabilize protein structure and protect the protein from aggregation [80]. The stability of salt bridges is influenced by factors such as pH, distance and geometric orientation of the residues involved. Predicting the presence and location of salt bridges in a protein provides valuable insight into protein stability. There are several freely available tools to predict salt bridges, including Tm predictor [http://tm.life.nthu.edu.tw/], PoPMusic [81], and SCooP [82]. These tools are mainly used to predict changes in the thermodynamic stability, melting temperature, and temperature-dependent stability of a protein.

Hydrogen bonds are another crucial type of interaction that contributes to protein structure. They play a key role in the formation of secondary structures, such as \(\mathrm{\alpha }\)-helices and \(\upbeta\)-sheets, by establishing bonds between carbonyl oxygen and amide nitrogen [83]. Several tools are available for predicting the number of hydrogen bonds in a protein, including HBPLUS [84], PyMol [85], and HAAD [86].

Disulfide bonds also play a vital role in the formation of protein structures. They contribute to the stability of protein structures under harsh environments, enhance their mechanical and thermodynamic stability, and minimize the likelihood of misfolding [87]. Computational tools have been developed to accurately predict disulfide-bonding networks and patterns in a protein, thereby aiding in the correct modeling of protein structure. Fariselli et al. [78] introduced a tool for predicting the disulfide bonding state of cysteines in proteins with a prediction accuracy of over 90%.

3 Natural product discovery through metagenomics

Traditionally, the search for bioactive natural products in microorganisms relied largely on activity-based screening approaches [88], which in turn necessitate the isolation and pure culture of the source microorganism. However, recent advances in culture-independent metagenomic and bioinformatic analyses have made it possible to search for novel natural products in microorganisms without the need for their pure culture. This approach offers the exciting potential to delve into the enzymatic mechanisms involved in the biosynthesis and modification of these natural medicinally important compounds. Despite the structural complexity of natural products, their biosynthetic pathways and the enzymes involved in their bioconversion exhibit a remarkable degree of conservation across diverse microbial lineages [23]. This conservation facilitates the discovery, annotation, and characterization of novel natural product biosynthetic enzymes and pathways through sequence homology searches and structural predictions [89]. The combined application of advanced bioinformatic tools and high-throughput screening methodologies offers a powerful approach for targeted mining of metagenomic data, with the potential to significantly accelerate the discovery of novel natural product biosynthesis pathways and subsequent characterization of valuable therapeutic agents and bioactive compounds.

The majority of bacterial natural products fall into the category of secondary metabolites that are encoded by conserved biosynthetic gene clusters (BGCs), a group of two or more closely linked genes that encode enzymes of the biosynthetic pathway for a specific metabolite or natural product [90]. This genomic organization facilitates the identification of natural products through genome mining approaches. Genome mining tools such as antiSMASH [91], PRISM [92], CLUSEAN [93], NP.searcher [94], and NRPminer [95] have been developed to identify putative BGCs in genome or metagenome datasets. AntiSMASH stands out among other tools by offering a comprehensive suite of tools and databases for automated genome mining of a wide array of secondary metabolites. By combining genome mining for BGCs and chemical structure prediction for the encoded secondary metabolites, PRISM significantly improves the detection of genetically encoded nonribosomal peptides and polyketides [92]. While these tools facilitate the identification of genomic loci responsible for natural product biosynthesis, challenges arise in connecting these loci to the specific chemical structures of the encoded products [96]. Genomic analysis has revealed that bacterial genomes house numerous orphan BGCs, which are clusters not yet associated with the natural products they encode. There are also numerous examples of isolated natural products that have not been linked to their corresponding BGCs [97]. ML approaches have shown potential in genome mining for natural biological products, predicting the structure of natural products, and inferring biological activity from BGCs or the chemical structure of the respective secondary metabolite. Recently Prihoda et al. [98] showed that ML can be used in several steps to find bioactive natural products in genome sequences, including genome annotations, feature representation, BGC detection, structure prediction, and activity profiling. Another study developed a comprehensive ML method to predict the structures and biological activity of secondary metabolites from microbial genome sequences [99]. This approach can be used to predict the structures of natural products encoded by orphan BGCs.

In light of widespread metagenomic explorations of diverse microbial niches, huge amounts of genomic data are now at our fingertips. This genetic bounty holds immense potential for bioprospecting, offering novel microbial secondary metabolites, with a spectrum of promising medical and biotechnological applications. Numerous attempts have been made to explore metagenome data for novel natural products. In a study by Nayfach et al. [9], over 100,000 BGCs were predicted in 52,515 metagenome-assembled genomes, which were cataloged from diverse microbial communities representing the Earth's microbiome. This antiSMASH-based BGC discovery yielded up to 54 times more BGCs than manually curated entries in the MIBiG dataset, highlighting a vast reservoir of unexplored microbial natural products. In a comprehensive computational and experimental study, a probabilistic algorithm named MetaBGC was developed and applied to identify potential BGCs in complex metagenomic sequences from various regions of the human microbiome (gut, mouth, skin, and vagina) [100]. Out of the 13 BGCs encoding type II polyketides, two were successfully cloned and expressed in a heterologous system, revealing their potent antibacterial activities against gut microbes and suggesting a potential role in microbial interactions within the gut environment. These findings underscore the urgent need for the development of advanced tools and pipelines for targeted mining of metagenomes for novel, game-changing microbial secondary metabolites with biotechnological and medicinal potential.

4 Future directions

An extensive literature review highlights that functional screening is, in fact, a major source of currently characterized enzymes from environmental samples. However, there are instances where the integration of FBS and SBS methods has proven to be successful. For example, in a study on the pre-screening of clone libraries using functional screening followed by insert sequencing, a remarkable 106-fold increase in the success rate was achieved in identifying genes encoding desired enzymes compared to direct sequencing approaches [101]. Both FBS and SBS methods offer distinct advantages and disadvantages. SBS approaches may have limitations in terms of sequencing cost and errors. Furthermore, uncertainty in functional annotations and their limitations in discovering novel enzymes pose challenges to their widespread applications. FBS approaches can be used to identify novel enzymes and facilitate the direct determination of gene functions. However, the FBS methods also suffer from higher costs, the lack of effective screening methods for certain enzyme activities, and the challenges associated with heterologous expression systems.

In the past decade, significant improvements have been made in the computational modeling of 3D structures of proteins. These advancements have made it possible to take advantage of protein structure modeling in screening for novel enzymes from metagenomic sequences. Structural modeling can be used to evaluate enzymes for substrate specificity, enantioselectivity, metal ion specificity, pH and temperature dependence, as well as stability and secondary catalytic function.

In the era of a rapid expansion in enzyme-related biological databases as repositories for genome sequences, enzymes, tertiary structures, active sites, as well as metabolic pathways and reactions, there is an increased demand for the development of functional and computational screening tools. It is evident that the integration of SBS and FBS methods, coupled with the utilization of structural modeling, paves the way toward efficient exploration of novel enzymes from high throughput metagenomic data. This combination of approaches presents a promising roadmap for effective enzyme and natural product mining in the future.