Background

Cytochrome P450 monooxygenases (CYPs) are heme-thiolate enzymes that catalyze a broad range of reactions, including hydroxylation, epoxidation, dealkylation, and heteroatom oxygenation [1]. In humans, CYPs participate in various innate metabolic pathways, e.g., steroid hormone biosynthesis or fatty acid metabolism, and are also involved in biotransformation of xenobiotics, such as drugs and environmental pollutants [2]. Such considerable involvement in a wide array of biological processes (organ development, hormone signaling, etc.) requires a fine tuning of CYPs to function properly. Consequently, any imbalance in enzyme availability or its malfunction, e.g., due to a genetic mutation, may lead to a disease state in humans [3, 4] or may change the susceptibility of an individual to environmental insults [5].

The human genome encodes 57 P450 genes grouped into 18 mammalian families. Despite the significant sequence diversity between CYP families, all proteins display common 3D structural elements shared with CYPs from other biological kingdoms [68]. Substrate specificity is determined by the size and shape of the active site cavity; by the availability of the substrate access channel and its physico-chemical characteristics; and by amino acid composition at the substrate recognition sites [9, 10]. In addition to structural elements mentioned herein, missense mutations altering enzyme activity can occur at the heme-binding site, within the protein core, and at the protein interface involved in electron transfer from the redox partners, ferredoxin (FDX) and P450 oxidoreductase (POR) [11, 12]. However, not all missense mutations in CYPs have functional implications and result in a disease phenotype. To this end, individual genome sequencing has revealed a considerable number of variants in CYP genes (see below), and this number is expected to grow with time. On the other hand, experimental assessment of the functional implications of all identified mutations is impractical. As a result, there is a growing demand from emerging personalized medicine for computational methods for the identification of disease causing mutations among the realm of CYP variants as well as for prediction of changes in drug metabolism by mutated CYPs.

There have been a number of attempts made to compile available information about CYPs, including their polymorphisms and activity. The Cytochrome P450 Engineering Database (CYPED) is a collection of all experimentally resolved 3D structures of CYPs retrieved from the Protein Data Bank (PDB) and categorized by CYP subfamilies [13]. CYPED also provides a tool for prediction of conserved modules in CYP structures [10]. However, no polymorphism data are available in this database, and human enzymes are underrepresented. Two databases provide information about CYP variants and their corresponding activity focused on drug metabolism: SuperCYP [14] and The Human Cytochrome P450 Allele Nomenclature Database [15]. The former database is limited to the single nucleotide polymorphism (SNP) data only and lacks information regarding clinical phenotypes. The latter database contains information about only half of the human CYPs (29 of 57 genes as of May 2013). More generalized public databases can be used as a better source of CYP variants: NCBI dbSNP [16] or UniProt humsavar [17]. These two databases can also contain associated disease phenotypes, although the annotations are lagging the recent publications.

There is a variety of methods for the prediction of functional implications of missense mutations, each of which utilizes a different heuristic (see recent reviews [18, 19] as well as examples of recent methods [2022]). However, there is no prediction method available designed specifically for analysis of missense mutations in CYPs, despite the importance of these enzymes in human health and their direct clinical relevance. At the same time, models generalized for the entire human proteome may not perform well with CYPs, as these enzymes have highly variable regions – substrate recognition sites (SRSs) – that cannot be recognized as critical functional spots by evolutionary based methods. On the other hand, a fraction of residues on the surface in CYPs are involved in the transient protein-protein interaction and electron transfer from a redox partner, but these residues may not be recognized as critical by the structure-based methods that consider mutations on the surface as less influential on a protein function. To fill this gap, we have developed a new method (MutaCYP) dedicated to the prediction of deleterious effects of missense mutations specifically in CYPs. MutaCYP combines evolutionary information and predicted structural information in 5 non-redundant sequence-based features in its prediction model. Our method was compared with two representative and commonly used methods: SIFT and PolyPhen-2. SIFT uses a prediction model based primarily on evolutionary information [23], whereas PolyPhen-2 adds protein structural data to the feature space [24]. MutaCYP outperforms both methods. At the same time, raw prediction scores by MutaCYP appear to be orthogonal to those by other evaluated methods. Hence, there is a potential to improve the accuracy of classification using a meta-predictor that combines predictions from these methods. MutaCYP is publicly available at http://research.cchmc.org/MutaSense/.

Methods

Datasets

The UniProt humsavar database (release 2012_10 of 31 October 2012) was used for the training and cross-validation of the prediction model. The release contained information about 562 variants in 51 human CYPs. Proteins containing only variants without disease association were excluded. The reasoning was that these proteins were most likely not yet annotated with respect to disease phenotype, and thus might introduce noise in the training as false negative instances. For example, according to UniProt, all known missense mutations in CYP1A1 are listed as benign. However, there is a solid body of evidence that some mutations in CYP1A1 can be disease causing [25, 26]. Specifically, mutation I462V significantly increases catalytic activity of CYP1A1 and is associated with estrogen-related cancers and other physiological disorders [27, 28]. After applying this exclusion criterion, 15 CYPs remained with 270 variants that were used to generate vectors, including 73 benign (true negative) and 197 deleterious (true positive) mutations. This training dataset was named TS270 (Additional file 1: Table S1). All excluded CYPs and their variants (36 and 292, respectively) were grouped as a separate blind set (named BS292; Additional file 1: Table S2), where association with a disease is not entirely clear, to see how predictions overlap and correlate between all validated methods.

The control dataset was derived from recently published literature on the new missense mutations identified in different CYPs with disease association. Inclusion criterion was the absence of the same mutation in the training set. A literature search yielded 30 new variants (29 deleterious and 1 neutral) for 4 human CYPs: CYP7B1 [29, 30], CYP21A2 [31, 32], CYP11B1 [33], and CYP27B1 [3436]. The control dataset was named CS30 (Table 1). All datasets are available for download from the home page of the method.

Table 1 Human CYP variants used for the control set CS30

Feature selection

With the goal of minimizing the feature space used for a prediction model, all explored sequence-based characteristics were evaluated using two inclusion criteria: (i) a feature displays a maximally possible discriminatory power quantified by F-score, which is defined below; (ii) a feature needs to be non-redundant to other already included features, with redundancy measured by Pearson correlation coefficient (r). F-score is defined as follows and was previously introduced for feature selection for prediction of protein-protein interaction sites [37].

F= x ¯ n x ¯ d σ n + σ d
(1)

where x ¯ n and x ¯ d are means of the feature over neutral and deleterious mutations, respectively; σ n and σ d are the corresponding standard deviations.

Evaluated features include physico-chemical properties of amino acids and their changes due to mutation, position specific similarity scores and Shannon entropy derived from multiple sequence alignment (MSA), predicted relative solvent accessibility (RSA). MSA was obtained using PSI-BLAST with three iterations against the NCBI nr database, following our previous protocol [38]. The complete list of the explored features along with their descriptions is available in Additional file 1: Table S3.

Feature selection is summarized in Additional file 1: Table S4. The best performance was observed with the set of features that have F-score ≥ 0.4 and r < 0.8. The final feature space used in subsequent evaluations and in MutaCYP consists of 5 features: absolute difference between similarity scores of wild type amino acid and mutation for a given position (Abs_dSS); absolute difference between sizes of wild type amino acid and mutation weighted by the difference of the corresponding similarity scores (ss_Abs_dSize); Z-score for Shannon entropy at a given position based on a window of 21 neighboring amino acids (zsEntropy21); predicted RSA (predRSA); and variance of predicted RSA for the window of 21 neighboring amino acids (varPredRSA21). Figure 1 displays the distributions of these five features for benign and disease causing mutations; Table 2 contains the corresponding F-scores and pairwise correlation coefficients.

Figure 1
figure 1

Distribution of the features used in the final prediction model over benign and deleterious mutations. A. Abs_dSS – absolute difference between similarity scores of wild type amino acid and mutation for a given position. B. ss_Abs_dSize – absolute difference between sizes of wild type amino acid and mutation weighted by the difference of the corresponding similarity scores. C. zsEntropy21 – Z-score for Shannon entropy at a given position based on a window of 21 neighboring amino acids. D. predRSA – predicted RSA. E. varPredRSA21 – variance of predicted RSA for the window of 21 neighboring amino acids. Whiskers indicate minimal and maximal values of a given feature.

Table 2 Features passed the inclusion criteria and used for the final prediction model

Models and validation

Two models, linear and non-linear, were used in this work. The former model was based on linear discriminant analysis (LDA) as implemented in the TOOLDIAG package [39]. The LDA-based models were employed for feature selection. The non-linear model is based on neural networks (NN) as implemented in the SNNS package [40]. Multiple NN architectures were evaluated. All NNs were trained using standard backpropagation (StdBP) and resilient backpropagation (Rprop) learning algorithms. Comparative analysis of the performance of different NNs can be found in Additional file 1: Table S5. The best performing NN appeared to be a feed forward network with 5 input nodes, 2 hidden layers with 10 and 5 hidden nodes, and 2 output nodes, trained using the Rprop learning algorithm.

All models herein were evaluated using 5-fold cross-validation on the training set. In case of NN-based models, additional 20% of vectors were withdrawn from each training subset to be used as a validation subset for choosing the best performing NN in each fold, which is then evaluated using a corresponding test subset. The final model employed in MutaCYP is a single NN that showed the best generalization from the validation to test subsets (Additional file 1: Table S5). A flowchart depicting the protocol for building a prediction model used in MutaCYP can be found in supplementary materials (Additional file 2: Figure S1).

Accuracy measures

The following measures of prediction accuracy were used in this work: the two-class classification accuracy (Q2), recall (R), precision (P), and Matthews correlation coefficient (MCC).

Q 2 = TP + TN TP + TN + FP + FN 100%
(2)
R= TP TP + FN 100%
(3)
P= TP TP + FP 100%
(4)
MCC= TP TN FP FN TP + FN TP + FP TN + FP TN + FN
(5)

where TP are true positives (deleterious mutations), TN – true negatives (benign mutations), FP – false positives, and FN – false negatives. MCC was used as the objective function to be maximized during feature selection, and as a measure of generalization during selection of NN for the final prediction model (Additional file 1: Tables S4 and S5).

Results and discussion

Feature selection

Missense mutations in CYPs may cause a disease phenotype due to many reasons including impediment of heme binding; misfolding or destabilizing of the protein; a change in the binding affinity to a substrate leading to reduced or increased enzymatic activity; alteration of the substrate/product turnaround in and out of the active site cavity; hindrance or abolishment of the binding to a redox partner (affecting the electron transport rate); and an altered ability to reside at the membrane such that the protein cannot properly localize within the cell.

Structure-based features, while being the most informative, cannot be used to full advantage, as most of the human CYPs are not structurally resolved. For example, it is difficult to map SRSs using sequence homology because they are highly variable regions. Moreover, some enzymes have a well-defined substrate access channel, whereas others do not. The protein-protein interaction interface with a redox partner is also not defined for human CYPs and cannot be mapped directly from remote homologs. Hence, a fraction of missense mutations on the surface cannot be discarded from the pool of potential effectors, as some prediction methods are inclined to do. The model for the prediction method presented herein utilizes sequence-based characteristics only.

Since many considered features are evolutionary based, the influence of MSA quality on these features was explored. In addition to a full sized NCBI nr database, two reduced nr versions were used to generate MSA, where redundant sequences with 90% and 70% identity were removed. Sequence clustering was performed using CD-HIT [41]. Changes in discriminatory power (F-scores) of the evolutionary based features depending on the sequence database used are summarized in Additional file 1: Table S3. Features based on MSA derived from the nr database, reduced by removing redundant sequences with over 90% identity (nr90), appear to provide the best distinction between benign and deleterious mutations (Additional file 1: Table S3, F b column). Hence, the following results and the final prediction model are based on the nr90 sequence database.

As expected, evolutionary-based features indicate that disease causing mutations occur in CYPs primarily at conserved sites and have unfavorable similarity scores for mutation amino acids. In this respect, the distribution of Abs_dSS displays the tendency for deleterious mutations to have a wider difference in similarity scores between the mutation and a wild type amino acid (Figure 1A, F = 0.73). Similar considerations were used as a basis for the SIFT and PolyPhen-2 methods. The former computes the probability of the occurrence of a given mutation at a given position based on MSA [23], whereas the latter method uses the dSS feature in its prediction model (see supplementary to [24]). Concordantly, ss_Abs_dSize shows a larger weighted change in size of an amino acid for deleterious mutations (Figure 1B, F = 0.61). Furthermore, Figure 1C shows that entropy for positions with deleterious mutations is shifted from the average entropy across neighboring residues towards negative direction, indicating a higher conservation (zsEntropy21, F = 0.49). Of note, entropy itself for deleterious mutations is closer to 0 than for benign mutations and has high F-score equal to 0.66. However, it is highly correlated with absolute difference between similarity scores, r(Abs_dSS, Entropy) = −0.82, and hence it was removed from the final feature space as a redundant feature (Additional file 1: Table S4).

More interestingly, predRSA appears among the strongest sequence-based characteristics capturing the disease causing mutations (F = 0.47). As follows from the name, RSA measures the extent of surface exposure (or conversely, a burial state) for a given residue in a given protein conformation normalized to a maximal possible exposure for a given kind of amino acid. The use of predicted RSA in the prediction of deleterious mutations is not entirely novel to this study. Two other studies reported prediction models that include predicted RSA [42, 43]. However, those models used a 2- or 3-state definition of solvent accessibility (e.g., buried, half-buried, and exposed), which limits its applicability. In this regard, we previously developed our own method for RSA prediction (SABLE) and showed that this structural characteristic is more meaningful and useful when considered as a continuous value [38]. Furthermore, with the overall accurate prediction of RSA by SABLE, we showed that the method is prone to over-prediction in terms of burial state for residues that are located in trans-membrane regions [44], at protein-protein interaction interfaces [37], and within structurally restrained regions [45]. These are exactly the places where one would expect deleterious missense mutations most likely to occur. Therefore, a certain bias in predicted RSA towards the burial state is expected to correlate with such mutations (Figure 1D). To this end, variance in predicted RSA (varPredRSA21) is a complementary, yet orthogonal, feature to predRSA (F = 0.45, r = 0.42). It describes local sequence environment in terms of surface exposure and appears to be lower for positions with deleterious mutations (Figure 1E), indicating a homogeneous environment (for example, most of the neighboring residues are predicted to be buried).

Evaluation of the model

Table 3 contains the summary of evaluations of the prediction models using the training set TS270. First, we assessed which of our models performs better, linear (LDA) or non-linear (NN). In 5-fold cross validation (Table 3; lines 1–2), the linear model appears to perform better, MCC = 0.54, compared to the NN-based model (MCC = 0.46). Second, we probed whether a consensus based model can improve the performance. Predictions by the five individual models derived from cross validation were combined using simple majority voting. NN-cons model shows the improvement over 5-fold cross validation, but both LDA-cons and NN-cons perform similarly yielding MCC = 0.53 (Table 3; lines 3–4). Next, in the efforts of simplifying the final prediction model, a single NN was chosen from cross validation that showed the best generalization from the validation to test subsets (see Additional file 1: Table S5 and Methods for details). Line 5 of Table 3 (highlighted with bold face) shows performance of the final model used for MutaCYP, which appears to be the best across all other models evaluated (MCC = 0.70).

Table 3 Performance of the prediction models on the training set TS270

Comparison with other methods

First, MutaCYP was compared with the SIFT and PolyPhen-2 methods using the TS270 set (Table 3, last three lines). Table 4 presents a confusion table for binary predictions and correlation between raw scores. Figure 2 presents ROC curves for all methods along with AUC values. Collectively, MutaCYP outperforms the other methods by all accuracy measures showing MCC = 0.70, Q2 = 88.52%, R = 93.40%, P = 91.09%, and AUC = 0.909. The SIFT method appears to perform the worst (MCC = 0.49, AUC = 0.824), and it does not provide predictions for CYP21A2 (UniProt ID: P08686; Ensembl ID: ENSP00000403721), probably due to the lack of MSA for this CYP in its pre-computed database. This excluded 63 mutations from evaluation of SIFT.

Table 4 Performance of the evaluated methods on the training (TS270) and control (CS30) sets
Figure 2
figure 2

ROC curves for predictions by the evaluated methods on the TS270 dataset.

Next, we compared the methods using the CS30 dataset, where 29 mutations are deleterious and not included in the UniProt humsavar database (see Methods for details). Again, SIFT did not provide prediction for CYP21A2, thus excluding 8 deleterious and 1 benign mutations from CS30. MutaCYP correctly predicted all mutations in CS30 (Table 4). PolyPhen-2 trained on the HumVar dataset correctly predicted the benign mutation (CYP21A2: A265V) and misclassified 3 deleterious mutations as benign (CYP11B1: M88I,T401A; CYP27B1:P143L). PolyPhen-2 trained on the HumDiv dataset wrongly predicted the benign mutation as deleterious, but confused only one disease causing mutation (CYP11B1:M88I). Of 21 deleterious mutations, SIFT predicted 5 mutations to be benign (CYP7B1:T297A; CYP11B1:M88I,A165D; CYP27B1:G57V,G102E), thus showing the highest confusion rate. Of note, all three compared methods predicted M88I in CYP11B1 as benign, whereas only MutaCYP correctly assigned it as deleterious.

Additional comparison of the methods has been conducted using the BS292 set (Table 5), where all mutations are assigned by UniProt as benign. Seven mutations were missing in SIFT prediction for BS292. All evaluated methods predicted considerable fraction of these mutations to be deleterious (61%, 42%, 45%, and 44% by MutaCYP, PolyPhen-2/HumVar, PolyPhen-2/HumDiv, and SIFT, respectively). These results further support our hypothesis that some missense mutations in BS292 are not fully annotated in the UniProt database. This goes in line with the previously published study, where similar concerns were raised about quality of UniProt annotations for missense mutations in cancer genes [46].

Table 5 Performance of the evaluated methods on the blind set (BS292)

Finally, we measured the correlation in predictions between the four methods by looking at the raw prediction scores (Tables 4 and 5). For TS270, MutaCYP yields moderate correlation with other methods, with Pearson correlation coefficients ranging between 0.58 and 0.69. Two PolyPhen-2 methods have the highest correlation reaching r = 0.96, and moderate correlation with prediction scores by SIFT (r = 0.66-0.67). In CS30 predictions, MutaCYP has low to moderate correlation with the other methods (r = 0.12-0.38). Despite the similar performance, MutaCYP has only r = 0.38 with PolyPhen-2 (HumVar). Again, predictions by the two PolyPhen-2 methods have the highest correlation (r = 0.94), and they both moderately correlate with prediction scores by SIFT (r = 0.58-0.67). For BS292, MutaCYP shows moderate correlation with other methods (r = 0.34-0.48), whereas PolyPhen-2 methods mutually correlate with r = 0.96, and moderately correlate with SIFT predictions (r = 0.47-0.48). Figure 3 illustrates the overlap between predictions by the evaluated methods using Venn diagrams. 56%, 47%, and 56% of missense mutations in the TS270, CS30, BS292 datasets, respectively, were unanimously classified by all methods.

Figure 3
figure 3

Overlap of predictions by the evaluated methods in TS270, CS30, and BS292 datasets.

Since MutaCYP appears orthogonal to other evaluated methods by showing only moderate correlation in raw prediction scores, there is possibility in improvement of overall prediction by combining two or more methods in a consensus-based classifier. Additional file 1: Table S6 shows the results of predictions by combinations of the considered methods using the simple majority voting and union approaches for consensus. While most of combinations do not show any improvement compared to MutaCYP alone, a consensus of MutaCYP and PolyPhen-2 trained on the HumVar data does slightly increase accuracy yielding MCC = 0.71. Perhaps, further improvement may be achieved by developing a consensus-based model using machine learning techniques, but it is beyond the scope of this study.

Conclusions

Cytochrome P450 monooxygenases constitute a large superfamily and are represented by 57 genes in the human genome. CYPs play important roles in human health via endogenous functions and interaction with environment. A special attention is required in the analysis of missense mutations in these genes to understand their role in the disease development and individual susceptibility to environmental cues. The new method called MutaCYP was developed along with the entailing web-server to address the need in the tailored interpretation of mutations in human CYPs. With five sequence based features, MutaCYP outperforms SIFT and PolyPhen-2. Predictions by the new method appear to be orthogonal to predictions by the evaluated methods and hence can be included in a meta-predictor to further improve the accuracy of classification. The large scale analysis of missense mutations in human CYPs using 4 different prediction methods supports the notion that not all mutations in the UniProt humsavar database have reliable annotations as neutral and must be carefully used in the training and validation protocols.