Skip to main content
Log in

Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease

  • Original Article
  • Published:
Biologia Aims and scope Submit manuscript

Abstract

Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Abbreviations

rRNA:

ribosomal ribonucleic acid

ML:

Machine learning

SVM:

Support vector machines

DNN:

Deep neural networks

RF:

Random forest

VAL:

Valine

ILE:

Isoleucine

GLU:

Glutamic acid

ARG:

Arginine

GLY:

Glycine

MET:

Methionine

GLN:

Glutamine

SPs:

Serine Protease

API:

Application Programming Interface

PseAA:

Pseudo-amino acid composition

Se:

Sensitivity

Sp:

Specificity

Acc:

Accuracy

TP:

True Positives

TN:

True Negatives

FP:

False Positives

ROC:

Receiver operating characteristic

TPR:

True positive rate

FPR:

False positive rate

BLAST:

Basic local alignment search tool

PHP:

Hypertext Preprocessor

HTML:

Hypertext Markup Language

CSS:

Cascading Style Sheet

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lilly M. Saleena.

Ethics declarations

Conflict of interest

Authors have no conflicts and/or funding information to declare.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary File S1 (DOCX 13 KB)

Supplementary Fig. S1 (JPG 472 KB)

Supplementary Fig. S2 (JPG 28 KB)

Supplementary Table S1 (DOCX 20 KB)

Supplementary Video S1 (MP4 8530 KB)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sunny, J.S., Kumar, A., Nisha, K. et al. Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease. Biologia 77, 3615–3622 (2022). https://doi.org/10.1007/s11756-022-01214-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11756-022-01214-4

Keywords

Navigation