Introduction

Oxygen is an essential part of the atmosphere and is necessary to sustain the most terrestrial life of living organisms as it used in respiration and regulation of a variety of cellular functions. The oxygen binding proteins (oxy-proteins) of various organisms considerably differ from one another and classified mainly on their structure and physiochemical properties as hemoglobin, hemocyanin, hemerythrin, myoglobin, leghemoglobin, and erythrocruorin. Each oxy-proteins have its own functional characteristics and structure with unique oxygen-binding capacity [1,2,3,4,5,6,7,8,9,10,11].

A number of computational methods have been proposed for identifying functional proteins on their primary sequences using machine learning approaches [12,13,14]. These methods are always needful to improve or to find new features for identifying protein family and their classes, sub-classes to avoid negative prediction or to reduce false positive rates.

In 2007, Muthukrishnan et al. developed Oxypred method for predicting oxygen-binding proteins using the simple amino (AC) and dipeptide composition (DC). The growing of protein sequence databases and availability of newly annotated sequences of oxy-proteins in the post genomic era, retrospectively encouraged us to introduce a new improved version of forged oxypred method. An attempt was made to include a recently generated highly non-redundant dataset in the development of Oxypred2 with a different protein features [15]. Recently, it has observed that the use of evolutionary profile in the form of a position-specific scoring matrix (PSSM) predicted various functional proteins with a higher accuracy [16, 17]. Hence, we applied many approaches, including the PSSM based evolutionary profile to improve prediction quality of oxy-proteins.

In this study, recently generated two different cut-off non-redundant datasets 50 and 90% were applied to develop Oxypred2. The difference between current and previous study reflected that PSSM and Hybrid approach, confusion matrix analysis, prediction score graphs, and ROC analysis has been added as extra features.

The many different prediction features are always important to understand their functional behavior aspects [18,19,20,21]. Here, we compared prediction performance of 50 and 90% similarity datasets in all modules to find the best identification of oxy-proteins. The prediction results and their complete analysis show that the developed method Oxypred2 is an improved version and alternative method for identifying oxy-proteins.

Main text

Methods

Datasets

The two different datasets sequences (90 and 50%) were extracted from UniProt databases by searching the individual keyword of oxy-proteins [22]. The final dataset contains 2498 and 5474 sequences as in 50 and 90% respectively. In sub-class, 47–114 erythrocruorin, 42–154 hemocyanin, 1378–2585 hemerythrin, 957–2462 hemoglobin, 34–34 leghemoglobin and 40–125 myoglobin as in both 50 and 90% datasets respectively. Due to less availability, 90% leghemoglobin dataset used for 50% dataset. The independent non-oxy protein datasets were constructed according to the size of oxy-proteins by selecting randomly as 2565 and 5499 on 50 and 90% cutoff datasets respectively.

Support Vector Machines

In this study, free downloadable package of SVM-light was used to generate modules [23, 24]. It has been successfully applied to numerous classification and pattern recognition problems such as classification of protein secondary structure, subcellular localization, DNA-binding, ATP-binding and transporter family protein predictions [25,26,27,28,29,30,31,32,33].

PSSM-profile

The PSSM profile provides the evolutionary information about residues conservation at a given position in a protein sequence. The construction of PSSM profile was generated using GPSR package available at http://www.imtech.res.in/raghava/gpsr/. We applied GPSR programs for PSI-BLAST searches against the non-redundant (nr) database using different iterations with a cutoff E value 0.001 [34, 35]. Further, each value has been normalized the range between 0 and 1 by the following equation,

$$Normalized\,value = \frac{{\left( {Value - Minimum} \right)}}{{\left( {Maximum - Minimum} \right)}}$$
(1)

In 0–1 value, minimum scores consider as “0,” and the maximum scores become “1”.

Evaluation models

We applied fivefold cross-validation techniques, as it was done by many investigators with SVM as the prediction engine. In this technique, the dataset was divided into five sets consisting of nearly equal number of sequences, where four sets used for training and remaining set for testing. The training and testing set was carried out five times in such a way that each part was used once for testing, and the whole process was repeated 20 times.

The objectives of our classifieds are to discriminate the oxy-protein from those of negative discipline, and the following terminology used to evaluate of our classifier as,

  • True positive (TP)—a protein is identified as an oxy-protein by both classifier and oxy-proteins model.

  • True negative (TN)—a protein is not identified as a oxy-protein by either the classifier or oxy-protein model.

  • False positive (FP)—a protein is identified as positively as oxy-protein by the classifier, but not by the oxy-protein model.

  • False negative (FN)—a protein is identified as oxy-protein by the oxy-protein model but not by the classifier.

In order to assess the prediction performances, accuracy (ACC), Mathew’s correlation coefficient (MCC), sensitivity (Sen) and specificity (Sep) were calculated using standard Eqs. (25) [36,37,38],

$$Accuracy\,\left( {ACC} \right) = \frac{TP + TN}{TP + TN + FP + FN}$$
(2)
$$Sensitivity\,\left( {SN} \right) = \frac{TP}{TP + FN}$$
(3)
$$Specificity\,\left( {SP} \right) = \frac{TN}{TN + FP}$$
(4)
$$MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }}$$
(5)

Results

Determining the relative amino acid composition will give a characteristic profile for protein [39]. Here, we calculated average AC composition of oxy-proteins according to their median scores. We observed that the residues Ala and Phe are present > 0.5% in oxy-50 sequences, which compared to non-oxy-50% sequences. In oxy-90 residues Ala, Phe, His and Lys are more 0.5% than non-90 sequences. In the oxy-50 classification dataset, residues Ala, Lys and Val are > 2, 3, 2% in Leg, hemo, and myo. Ala and Arg residues are very less (− 3%) in Hcy-50 and Leg-50 sequences respectively. In 90% oxy-datasets, Ala residue is 2% more in Ery-90 and leg-90, Glu, Lys, and Val are present 3% more in heme, myo, and leg proteins respectively. Ala, Glu, and Arg are less 2% in hcy, ery and leg proteins, results shown in Fig. 1, Additional file 1: Figure S1 and Additional file 2: Figure S2. In sub-classes, sequence length profile of oxy-50 and 90 were compared, found most of the sequences of heme and hemo proteins belong to the range between 101 and 200. The other proteins are distributed in different length ranges (Additional file 3: Figure S3).

Fig. 1
figure 1

Amino acid distribution difference between oxy and non-oxy sequences: It has been calculated based on median scores. a Difference between oxy-50 and non-50. b Difference between oxy-90 and non-90. c Differences within the oxy-sub-classes of oxy-50 datasets. d Differences within the oxy-sub-classes of oxy-90 datasets

In AC approach prediction, we achieved the maximum accuracy was 82.05, and 87.79% in oxy-50 and oxy-90 datasets. DC-method, maximum accuracy was 80.42 and 84.81% in oxy-50 and oxy-90 respectively. The complete prediction results are shown in Additional file 4: Table S1, and the classification approach results shown in Table 1. The evolutionary profile based PSSM method have been applied to many functional protein predictions [40, 41]. In PSSM methods achieved the maximum accuracy was 85.10 and 81.81% in oxy-50 and oxy-90 datasets respectively. We observed that, in classification the PSSM method prediction accuracy was slightly increased in Ery, Hcy, Heme, Leg, and Myo in oxy-50 than the oxy-90 datasets.

Table 1 The performance of oxy-proteins sub-class SVM-models (Ery, Hcy, Heme, Hemo, Leg and Myo) in different approach and comparison between oxy-50 and oxy-90 output data

Further, to improve the prediction accuracy, a Hybrid approach based modules were developed [42]. The prediction accuracy was 81.73 and 83.51% in oxy-50 and oxy-90 respectively. In classification, Hcy, Heme, Hemo accuracy were slightly increased in oxy-90 than oxy-50. Overall, DC and Hybrid method prediction results are shows similar in oxy-50 and oxy-90, and it doesn’t show any significance differences (Table 1).

In order to verify the prediction performance of their developed models, we also did the ROC analysis with our original data, and achieved area under the curve (AUC) 0.894 and 0.959 in oxy-50 and oxy-90 (Additional file 5: Figure S4), in classification AUC’s shown in Additional file 4: Table S2 and Fig. 2. In addition, a confusion matrix based prediction scores graphs were generated [43], to cross-check the developed model’s performance on original data. According to our results, no miss-classifications occurred in the proposed models; it means no positive sequence identified as negative and no negative sequence defined as positive. So that, our developed models are good in recognizing the positive and negative sequences.

Fig. 2
figure 2

ROC curve oxy-classification in all approaches. The performance of oxypred2 developed models by ROC plots in all oxy sub-classes. The area under curve was measured for all approached models

At the same time, classification based models also doing the best performance recognizing positive and negative sequences. Eventhough, some sequence couldn’t identified by their own class models, rather identified by other class models. In oxy-50 datasets 3-Ery, 10-Hemo and 5-Myo sequences are not recognized by their models in all approaches. Rather, it recognized by other sub-class models. In oxy-90 datasets, 2, 4, 2 sequences of Ery, Hemo and Myo are confused and not recognized by their models, but identified by other models. Interestingly, some sequences of Ery, Hemo, and Myo are not identified by their models and other models too. The complete confusion matrix results of both oxy-50 and oxy-90 shown in Additional file 4: Table S3. The prediction score graphs are mainly developed to show the performance of models in separation of positive and negative sequences. According to the graphs, separation with maximum margins shown in DC, PSSM and Hybrid approaches. However, the confusion matrix result shows that some sequences are very similar between Ery, Hemo, Myo, and these sequences may be evolutionary important (Additional file 6: Figure S5 and Additional file 7: Figure S6).

Also, we compared prediction profile performance of accuracy, sensitivity, and specificity on threshold level. We found that most of the classes are showing better performance in the 0–1.5 thresholds, mostly the ACC, Sen and Sep scores are associated with a particular point threshold, but few of them doesn’t show any connections over the thresholds. Ery-50 and 90 AC data’s are not showing association with ACC, Sen, and Sep, but in DC and PSSM approaches, both Ery-50 and Ery-90 data’s are having connections in negative thresholds. Interestingly, in hybrid approach, Ery-50 data shown in negative threshold, but Ery-90 appeared at positive threshold. In Hcy Class, AC-90 data shown at negative threshold, rest all approaches appears in positive threshold (0–1.5). However, all Heme and Hemo class data’s are joining in positive threshold in all approaches. In Leg class, only AC-50 shown in negative and rest all approaches in positive threshold. In Myo class, AC-50 does not shown cross, but DC-90 and PSSM-50 at “0” threshold. Hybrid-90 shown in positive threshold and all other approaches in negative thresholds. Moreover, in most cases, accuracy and specificity data’s are similar (Additional file 8: Figure S7).

In Oxypred2 study, average ACC, Sen, and Sep from − 1.5 to +1.5 thresholds and compared the performance of both oxy-50 and oxy-90 sub-classes in all approaches. We observed that, Ery and Myo sensitivity data increased in oxy-90 than oxy-50. Moreover, all sub-classes showing more than 80% ACC, Sen and Sep in oxy-90. In oxy-90 classification, heme and hemo’s specificity is less 80% in PSSM and Hybrid, but it slightly better than oxy-50 average data. In all approaches, Ery class sensitivity data improved in oxy-90 than oxy-50 (Additional file 9: Figure S8). In PSSM method, prediction accuracy was increased than AC and DC methods.

In order to have comparison with our new and existing method (oxypred) using blind data contains 502 oxy-proteins, which were not present in our datasets. According to oxypred AC and DC methods identified 96.61% (485) and 97.81% (491) respectively. But oxypred-2 of oxy-50 models identified as 98, 99, 99, and 99% and oxy-90 models recognized as 99.20, 100, 100 and 100% in AC, DC, PSSM and Hybrid methods respectively.

Discussion

Here, we presented an improved version of Oxypred for identifying oxy-proteins using various features [44, 45]. Here we applied two different similarity cutoff datasets. All methods recognize 100% positive and negative sequences. Hemocyanin, Hemerythrin, and Leghemoglobin classes recognizing 100% in all approaches. Oxy-50 models recognizing individual sequences as 89, 98.9 and 87.5%, and in oxy-90 models as 98, 99.8 and 98.4% identified positively as erythrocruorin, hemoglobin, and myoglobin respectively.

Further, compared with existing methods, performance based on the newly retrieved dataset, which shows nearly 97% recognition. However, our newly developed models were able to identify almost 99.99% and 100% in the oxy-50 and 90 models respectively. According to our prediction results, oxy-90 models are making a better prediction than oxy-50. However, PSSM based approaches are showing better performance in identifying oxy-proteins in both cases. Also, we found less error rate, according to confusion matrix analysis. The present oxypred2 method is able to achieve better prediction in comparison to previous method in identifying oxy-proteins. This study is an alternative method for identifying oxy-proteins and hope it will be useful to the scientific community.

Limitations

  • The exponential growth and availability of fresh annotated protein sequences in databases motivated us to develop an improved version.

  • Two different sequence similarities cutoff 90 and 50% were used with various features for predicting oxy-proteins.

  • The oxy-90 models are making a better prediction than oxy-50 models, and our approaches are faster and achieve a better prediction performance over the existing method.

  • Finally, a web-server Oxypred2 has been developed for identifying oxygen-binding proteins.