Skip to main content

Data mining of metagenomes to find novel enzymes: a non-computationally intensive method


Currently, there is a need of non-computationally-intensive bioinformatics tools to cope with the increase of large datasets produced by Next Generation Sequencing technologies. We present a simple and robust bioinformatics pipeline to search for novel enzymes in metagenomic sequences. The strategy is based on pattern searching using as reference conserved motifs coded as regular expressions. As a case study, we applied this scheme to search for novel proteases S8A in a publicly available metagenome. Briefly, (1) the metagenome was assembled and translated into amino acids; (2) patterns were matched using regular expressions; (3) retrieved sequences were annotated; and (4) diversity analyses were conducted. Following this pipeline, we were able to identify nine sequences containing an S8 catalytic triad, starting from a metagenome containing 9,921,136 Illumina reads. Identity of these nine sequences was confirmed by BLASTp against databases at NCBI and MEROPS. Identities ranged from 62 to 89% to their respective nearest ortholog, which belonged to phyla Proteobacteria, Actinobacteria, Planctomycetes, Bacterioidetes, and Cyanobacteria, consistent with the most abundant phyla reported for this metagenome. All these results support the idea that they all are novel S8 sequences and strongly suggest that our methodology is robust and suitable to detect novel enzymes.

This is a preview of subscription content, access via your institution.

Fig. 1


Download references


The authors wish to express their gratitude to National Science and Technology Council, Mexico for providing the financial support for this research (Project No. INFR-2016-01-269833). The authors thank César de los Santos-Briones and Mildred R. Carrillo-Pech for their technical assistance.

Author information

Authors and Affiliations



All the authors contributed to this work. Góngora-Castillo and Ramirez-Prado designed and performed the experiments and analyzed the data; Caamal-Pech, Contreras-De la Rosa and Apolinar-Hernández participated in performing the experiments and the data analysis. López-Ochoa and Quiroz-Moreno participated in drafting the paper and discussing results. O’Connor-Sanchez, Ramirez-Prado and Góngora-Castillo conceived and designed the research and wrote the paper.

Corresponding authors

Correspondence to Jorge H. Ramírez-Prado or Aileen O’Connor-Sánchez.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical statement

Each of the authors confirms that this manuscript is original, has not been previously published and is not currently under consideration by any other journal. Additionally, all of the authors have approved the contents of this paper and have agreed to the 3 Biotech’s submission policies. The manuscript has two corresponding authors, who are Dr. Jorge H Ramírez-Prado and Dr. Aileen O’Connor-Sánchez.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Góngora-Castillo, E., López-Ochoa, L.A., Apolinar-Hernández, M.M. et al. Data mining of metagenomes to find novel enzymes: a non-computationally intensive method. 3 Biotech 10, 78 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Proteases
  • NGS
  • Bioinformatics pipeline
  • Pattern matching