Background

Catalases (EC 1.11.1.6) are iron porphyrin oxidoreductase enzymes that scavenge hydrogen peroxide into water and oxygen [1, 2]. They are heme-containing tetrameric enzymes found in subcellular organelles (peroxisomes), the primary source of H2O2 production during oxidative stress conditions via photorespiratory oxidation, beta oxidation of fatty acids, and purine catabolism [3]. CAT plays a crucial role due to pathological events connected to their dysfunction, such as increased vulnerability to apoptosis, tumor stimulation, regulated aging, and inflammation. It also aids in defensive mechanisms and protects the cell from oxidative damage. Another significant property of catalase is its strong catalytic activity, using H2O2 as a substrate to oxidize phenols, insecticides, herbicides, polyaromatic hydrocarbons, and synthetic textile dyes [4]. Catalase was the first enzyme to crystallize and isolate. They are found in various plant species such as tobacco, Arabidopsis thaliana, pepper, mustard, saffron, maize, castor bean, sunflower, cotton, wheat, and spinach [5,6,7,8,9,10,11]. The role of catalase in aging, senescence, and plant defense has been of significant importance. In light of the different applications of catalase mentioned above, the current work is being conducted for in silico analysis from plant sources. Computational investigation of the plant catalase amino sequence revealed the conserved secondary structure in sequences that play a crucial role in evolution. Primary research on catalases was conducted to examine their characteristics and key biological functions. Analyses of the phylogeny of the catalase gene has indicated the existence of three primary clades that separated themselves early in the evolution of this gene family by at least two gene duplication events [12]. A phylogenetic approach could help us account for the intrinsic divergence in enzyme dynamics induced by the natural evolution of sequence variation across time [13]. As genomics advances, computational tools are becoming increasingly crucial in helping to find and describe possible gene families for various industrial uses. This helps untangle the sequence-structure-functional relationship between enzyme protein sequences [14]. The analysis of genes and proteins in silico has gained increased interest, emphasizing the development of biomarkers, drug design, and the development of a very effective microbiological agent suitable for a wide range of industries. The present work aims to understand the catalase evolutionary relationship of plant species and analyze its physicochemical characteristics, homology, phylogenetic tree construction, secondary structure prediction, and 3D modeling of protein sequences and its validation using a variety of conventional computational methods to assist researchers in better understanding the structure of proteins.

Methods

Protein sequence recovery

In FASTA format for various computational analyses, sixty-five full-length catalase protein sequences from various plant sources were retrieved from the NCBI (National Center for Biotechnology Information) database. The number of protein sequences with accession numbers and source organisms is given in Table 1.

Table 1 Selected protein sequences of catalases from different plant sources

ProtParam tool for primary sequence analysis

The ExPasy ProtParam tool was used to compute the physiochemical parameters of the selected catalases. ProtParam calculates a variety of physicochemical properties that can be derived from the sequence of a protein. The molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index, and grand average of hydropathicity (GRAVY) are all parameters computed by ProtParam [15] (http://web.expasy.org/protparam/).

Multiple Sequence Alignment (MSA)

The multiple sequence alignment of protein profiles was developed using MEGA 6.1 software to verify the accuracy of the alignment. The ClustalW program was used to perform multiple alignments of sequences.

Amino acid composition

MEGA 11 examined the catalase-encoding amino acid composition where all species’ individual amino acid frequencies were retrieved (https://www.megasoftware.net/).

Phylogenetic tree construction

To better understand the evolutionary relationships between plant species, catalase phylogenetic trees were constructed with MEGA6 software, and the visualization of phylogenetic tree patterns was performed using the neighbor-joining (NJ) method or UPGMA [16].

Motifs search and domain discovery

The analysis of motifs was done using the MEME tool (http://meme.sdsc.edu/meme/meme.html), which was also used to search their protein family using the NCBI conserved domain database (CDD) (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). The biological activities of conserved protein motif data collected by MEME were analyzed using BLAST, and domains were assessed using InterProScan by offering the most significant possible match of sequences based on their highest similarity score [17].

Prediction of secondary structure

Secondary structures have a direct impact on how proteins fold and deform. This is how various amino acid sequences of plant catalase form helixes, sheets, and turns in the molecule. SOPMA (self-optimized prediction method with alignment) was used to predict the secondary structure of different plant catalases [18]. It is a self-optimized homologous tool based on Levin and his colleagues [19].

Comparative 3D modeling

A query protein sequence from each cluster group generated from a phylogenetic tree of plant catalase was analyzed, and comparative homology modeling was performed using the SWISS-MODEL (http://swissmodel.expasy.org) [20], based on automated comparative 3D modeling of protein structures.

Model evaluation

The most crucial step in homology modeling is model evaluation, which demonstrates that the modeled protein is of acceptable quality. Here, the predicted CAT model was evaluated and verified by the ERRAT value [21], Verify3D score [22], and PROCHECK [23] programs available from the SAVES server (http://nihserver.mbi.ucla.edu/SAVES). The quality of the predicted model was evaluated by Ramachandran plot assessment.

Protein-protein interaction

STRING v10.0 (http://string-db.org/) server was used to determine the catalase interaction of Arabidopsis thaliana with other closely related proteins. The query sequence was Arabidopsis thaliana with accession number CAA45564.1, and a functional protein association network was created [24].

Results

Retrieval of sequences

The protein sequences of many enzymes like peroxidases [25,26,27], pectinases, proteases [28], lipases [29], phytases, polyphenol oxidases [15], and cellulases [29] have been assessed and analyzed using bioinformatics tools. The current study used various bioinformatic tools to analyze the protein sequences of industrially important enzyme catalases from various plant sources. Around 150 catalase protein sequences from various plant sources were initially retrieved from NCBI using the BLAST method. From there, sequences with more than 70% similarity were selected where only 65 sequences were computationally evaluated based on full-length protein sequences (see Table 1). The diversity of plant sources for catalases was observed and found the largest for Oryza sativa, with 11 accession numbers forming the main group. Oryza sativa consists of four catalase genes OsCATA, OsCATB, OsCATC, and OsCATD [30], with functional variations under various abiotic stress conditions. Multiple accessions of the same catalase source help us gain insight into the structural and functional diversity of enzymatic proteins.

Physicochemical characterization

ProtParam was used to elucidate several physiochemical properties of the sequences. The amino acid residue variability in the 65 catalase protein sequences studied ranged from 90 to 533. The molecular weights varied between 10,322.46 and 61,366.87 daltons, while the pI values varied between 4.53 and 7.95. Most catalases had pI ranging from 5 to 7, while AAF34718 of Capsicum annuum has the pI value of 7.11, and the Oryza family placed in group F of the phylogenetic tree showed pI ranging from 4 to 5. Other physicochemical characteristics such as instability index, aliphatic index, and hydropathicity (GRAVY) were also variable for these CAT proteins. The aliphatic index measures the relative volume filled by the aliphatic side chain of amino acids such as alanine, valine, leucine, and isoleucine and provides information on the thermostability of globular proteins. It may be seen positively in increasing the thermostability of globular proteins. The following formula is used to determine the aliphatic index [31].

$$\mathrm{Aliphatic}\ \mathrm{index}=\mathrm{X}\ \left(\mathrm{Ala}\right)+\mathrm{a}\times \mathrm{X}\ \left(\mathrm{Val}\right)+\mathrm{b}\times \left(\mathrm{X}\ \left(\mathrm{Ile}\right)+\mathrm{X}\ \left(\mathrm{Leu}\right)\right)$$

The coefficients a and b are the relative volume of valine side chain (a = 2.9) and of Leu/Ile side chains (b = 3.9) to the side chain of alanine.

Plant catalases are assumed to be thermostable based on the data shown in Table 2. The instability index represents the in vivo half-life of a protein, and a number greater than 40 suggests a half-life of less than 5 h, while a value less than 40 indicates a half-life of more than 16 h. It also estimates the stability of the protein molecule [32, 33]. Most plant catalases have an instability index of less than 40, except a few that belong to the Oryza, Capsicum annuum, and Brassica juncea families. The hydrophobicity value of a peptide is represented by the grand average hydropathicity index (GRAVY), which is calculated as the sum of the hydropathy values of all amino acids divided by the sequence length, revealing that the negative value of the obtained plant proteins is hydrophilic.

Table 2 Physiochemical characterization of protein sequences of plant catalases as revealed by ProtParam

Assessment of phylogenetic tree and MSA

The phylogenetic tree revealed six unique clusters labeled A, B, C, D, E, and F, each of which had 4, 22, 12, 5, 7, and 15 protein sequences are shown in Fig. 1. Multiple accessions belonging to the same genus were grouped, suggesting similarity at the sequence level, except for the Oryza sativa protein sequence was distributed in both groups D and F. The phylogenetic analysis provides a depth understanding of how species evolve due to genetic alterations. Scientists can use phylogenetics to examine the path that connects a modern plant CAT organism to its ancestral origin and anticipate future genetic divergence. It can also be helpful in comparative genomics, which analyzes the relationship between genomes of different species by gene prediction or discovery, locating specific genetic regions along a genome [34,35,36]. Before building the phylogenetic tree, the alignment of multiple sequences is shown in Fig. 2, revealing the degree of homology between the sequences from different plant sources. This information could be used to synthesize a specific catalase probe or primer that would serve as a marker to remove putative genes from sequenced plant strains. The advancement in the comparative genomic study of proteins provides a detailed understanding of functional genes within and between plant species, providing clear evidence for evolution research and gene function hypotheses of plant catalase [37].

Fig. 1
figure 1

Construction of phylogenetic tree of protein sequences of plant catalases using NJ method. The unique clusters A, B, C, D, E, and F are highlighted, consisting of 4, 22, 12, 5, 7, and 15 members, respectively

Fig. 2
figure 2

Multiple sequence alignment of distinct clusters A, B, C, D, E, and F of plant catalases

Motifs and domain identification

The structure and functional complexity of enzymes can be predicted and assessed using attributes such as sequence and function order features, domains, and motifs. Sequence motifs identified by protein sequence analysis can be used as signature sequences for targeted enzymes to determine their putative functions [38,39,40]. The distribution of 5 sequence motifs among 65 plant catalases was analyzed, uniformly distributed with a width length of 50 with the best possible amino residue sequences, as shown in Table 3. When these motifs were subjected to BLAST, they resembled the plant catalase superfamily PLN02609.

Table 3 The five motifs with best match possible amino acid sequences with their respective domain

Amino acid composition

MEGA 11 was used to compute the composition of the amino acid sequences individually. The average amino acid composition was highest for proline at 7.38%, followed by aspartate (7.12%) given in Table 4, suggesting significant conformational rigidity of the secondary structure of the protein due to the distinctive cyclic structure of the proline side chain [41].

Table 4 Amino acid composition (%) of CAT protein from different plant sources

Prediction of secondary structure

Predicting the secondary structure of proteins is critical to understanding protein folding in three dimensions. The secondary structure is predicted using the primary protein sequence [42]. Using SOPMA, the predicted secondary structure of protein sequences revealed the predominance of random coils with more than 40% except for a few sequences such as Capsicum annuum, Solanum melongena, Solanum lycopersicum, Oryza meridionalis, Oryza rufipogon, Oryza glaberrima, and Oryza barthii, which had extended arms in the majority. The alpha helix and beta turn found the highest repeats in Populus deltoides and Oryza sativa, as given in Table 5.

Table 5 Secondary structure prediction of plant catalases using SOPMA

Comparative homology modeling and its functional analysis

To predict the 3D structure, a well-known template sequence is required, similar to the query sequence. A single organism from each cluster was selected, as shown in Table 6, and homology modeling of the 3D protein structure was carried out, where Arabidopsis thaliana was found as the query sequence to have the highest sequence identity and the GMQE score. The 3D structure was built by SWISS-MODEL using template 4qol.1.A Bacillus pumilus catalase by extrapolating experimental data from an evolutionarily related protein structure that serves as a template in Fig. 3, and the quality estimation of the predicted model is shown in Fig. 4a. The template’s sequence identity was 53.8% compared to the query sequence, the QMEAN score was −1.44, the GMQE value at 0.81 values, and the predicted model’s oligo state was homotetramer with 1.65 A resolution [43]. As part of the evaluation and validation process, the predicted protein model of the query sequence (in. PDB format) was uploaded to many servers. The Ramachandran plot analysis showed that 89.8% resided in the most favored (red) regions, while 10.1% fell into the additional allowed (brown) regions and 0.4% in the generously allowed regions, validating the quality of the modeled structure given in Fig. 5.

Table 6 Characterization of selected organism modeling from each cluster evaluated by SWISS-MODEL
Fig. 3
figure 3

Predicted protein model of catalase enzyme of Arabidopsis thaliana showing distinct four homo-tetrameric chains

Fig. 4
figure 4

Predicted protein model quality estimation by SWISS-MODEL

Fig. 5
figure 5

Ramachandran plot of predicted CAT model from Arabidopsis thaliana generated from PROCHECK. Residues in most favored regions (A, B, L)—89.8%. Residues in additional allowed regions (a, b, l, p)—10.1%. Residues in generously allowed regions (~a, ~b, ~l, ~p)—0.4%. Residues in disallowed regions—0.4%

The overall G factor of dihedral angles and covalent forces was −0.16, higher than the allowable threshold of −0.5. A high G factor indicates that a stereochemical characteristic correlates with a high probability of conformation [44, 45]. The predicted model was submitted to the SAVES server. ERRAT plots were used to examine the protein model’s atom distribution with one another and to make decisions regarding the model’s reliability when evaluating the amino acid environment. The overall quality factor of ERRAT was 92.5, indicating a slightly negligible value of the individual residues (Fig. 6). The Verify3D suggested that the CAT model has at least 80% of amino acids with a score > = 0.2 in the 3D/1D profile, while the average residue was around 70.2%, suggesting the compatibility of the predicted model with its amino acid residues [46]. The QMEAN Z-score in Fig. 4b and c was −1.4, which was in the expected range of 0.0 to −2.0, representing a well-defined structure [47]. The cellular machinery is built on a foundation of proteins and their functional relationships. It is necessary to consider a network of webs between organisms to understand biological phenomena. The STRING analysis revealed ten predicted interacting partners of query CAT protein from the organism Arabidopsis thaliana (accession number CAA45564.1), which encodes peroxisomal catalase and revealed glutathione reductase as the closest interacting protein with the shortest distance. On the contrary, ACX5 (putative peroxisomal acyl-coenzyme A oxidase) remained distant from the query protein (Figs. 7 and 8) [48].

Fig. 6
figure 6

ERRAT plot of Arabidopsis thaliana catalase model with overall quality factor 92.47

Fig. 7
figure 7

Map of the protein-protein interaction of Arabidopsis thaliana catalase protein

Fig. 8
figure 8

Predicted interacting protein partners of the query sequence from STRING server

Discussion

Computational approaches have established themselves as a valuable complement to our understanding of the protein universe and its properties. In silico analysis is one of the most helpful tools that contributes significantly to computational biology for exploring the structural and functional properties of the protein. Hence, the study was conducted to explore the structural and functional properties of catalase enzymes from plants using different bioinformatics tools such as ProtParam, MEGA-X, SOPMA, SWISS-MODEL, and SAVES server. The Expasy tool revealed several physiochemical characteristics of the retrieved catalase sequences, each representing its unique behavior. The pH at which a protein does not have a net electrical charge and is considered neutral is known as its isoelectric or isoionic point [49]. In the development of buffer systems for purification and isoelectric focus, the prediction of pI is critical. The study suggested that the theoretical pI value of most plant catalases is acidic ranging from 5 to 7, but Capsicum annuum has an alkaline pI value of 7.11. The instability index of protein catalases ranged from 28.94 to 44.90, except for a few species of catalases having an index of more than 40 with accession number CAD42908, CAD42909 (Prunus persica), AAD17934, AAD17935, AAD17938 (Brassica juncea), KFK30147 (Arabis alpina), CAA85424 (Nicotiana plumbaginifolia), BAF91369, AAF34718 (Capsicum annuum), BAA81682, BAA81681 (Oryza glaberrima), and BAA81680 (Oryza barthii). The aliphatic index refers to the percentage of a protein’s total volume occupied by its hydrophobic aliphatic side chains. The heat stability of a protein depends on its aliphatic index. A higher aliphatic index means that proteins are better able to withstand high temperatures [50]. Catalases with an aliphatic index ranging from 65.66 to 75.55 have substantial amounts of hydrophobic amino acids and are very thermally stable. The hydrophilic nature of the plant catalases was observed with the GRAVY score. The GRAVY negative score indicates that the protein could be globular (hydrophilic) rather than membranous (hydrophobic). This information could aid in the identification of these proteins [51]. The phylogenetic tree analysis was constructed using the maximum likelihood method to show evolutionary relationships among plant catalases. The distribution of Oryza sativa in different clusters C, D, and F revealed its genetic diversity and similarity with Festuca arundinacea and Saccharum spontaneum. Using a Pfam database search and NCBI/CDD-BLAST, the proteins were categorized into specific families based on the presence of a specific domain of their sequences. The NCBI BLAST designated the PLN02609 superfamily for catalase proteins with conserved domains. Overlapping annotations on the same protein sequences are generated by a superfamily, which is a collection of conserved models that have evolutionary domains. Protein secondary structure prediction from sequences is regarded as a link between the prediction of primary and tertiary structures [52]. Based on catalase secondary structure prediction, it was revealed the predominance of random coils followed by alpha helix in most of the catalases [3], which is highly similar to the results of CAT1 genes of PgCAT1, Soldanella alpina, and Gossypium hirsutum [7]. Random coils are irregular secondary arrangements found in the N and C terminal arms and loops of the protein structure occur because of electrostatic repulsion and steric hindrance of bulky adjacent residues such as isoleucine or charged residues such as glutamic acid or aspartic acid. In a random coil state, the average conformation of each amino acid residue is independent of the conformations of all residues other than those immediately proximal in the primary structure [53]. The amino acid composition of plant catalases revealed the highest proline content, which could explain the predominant coiled structural content. Proline has the unique ability to cause coiling by disrupting secondary conformations by causing kinks in polypeptide chains [54]. In silico prediction of a 3D model of a protein is a difficult element of correlating data received from NMR or crystallography-based approaches [48]. The query sequence (CAA45564) was blasted against PDB to find the best template. The highest sequence identity of 53.8% with negative QMEAN value and GMQE score suggested the template selection 4qol.1.A of Bacillus pumilus catalase. The validation of the predicted structure was performed by computational tools where 89.8% favored region of Ramachandran plot implied good quality of the model. The SAVES server tools ERRAT, Verify3D, and QMEAN Z-scores suggested a well-defined protein structure. The functional relationships of our query sequence revealed the glutathione reductase as the closest interacting protein with the shortest distance, which may be associated with the overlapping of its functional roles in the metabolic pathway [55].

Conclusion

In silico analysis of plant catalase protein provides insight into the numerous catalytic sites, allowing for possible manipulation of desirable qualities relevant to various sectors. Phylogenetic analysis revealed the similarity of various plant catalases, elucidating how species evolve genetically. Scientists can use phylogenetics to determine the genetic link between a modern organism and its ancestral origin and anticipate future genetic divergence. Numerous conserved amino acid residues among distinct clusters may allow for developing particular probes or markers that reflect source species from a specific taxon. Secondary structure analysis confirmed the predominance of a random coil followed by an alpha helix, an extended strand, and a beta turn. Plant catalases had the highest proline content in their amino acid composition, which could explain their coiled structural content. Proline has the unique ability to cause coiling in polypeptide chains by disrupting secondary conformations. The predicted 3D CAT model from Arabidopsis thaliana was a homotetramer, thermostable protein with 59-KDa weight, and its structural validation was confirmed by PROCHECK, ERRAT, Verify3D, and Ramachandran plot. In silico protein structure analysis is an extremely valuable technique for exploring protein structure-function relationships when crystal structures are unavailable. It can also help predict ligand-receptor interactions, enzyme-substrate interactions, mutagenesis experiments, SAR data, and loop structure prediction. While these studies build a robust foundation for wet-lab experimentation, they also provide a strong framework for looking at novel sources utilizing metagenomics approaches and directed evolution to incorporate desired functional qualities.