Computational pipeline to identify and characterize functional mutations in ornithine transcarbamylase deficiency

Ornithine transcarbamylase (OTC) (E.C. 2.1.3.3) is one of the enzymes in the urea cycle, which involves in a sequence of reactions in the liver cells. During protein assimilation in our body surplus nitrogen is made, this open nitrogen is altered into urea and expelled out of the body by kidneys, in this cycle OTC helps in the conversion of free toxic nitrogen into urea. Ornithine transcarbamylase deficiency (OTCD: OMIM#311250) is triggered by mutation in this OTC gene. To date more than 200 mutations have been noted. Mutation in OTC gene indicates alteration in enzyme production, which upsets the ability to carry out the chemical reaction. The computational analysis was initiated to identify the deleterious nsSNPs in OTC gene in causing OTCD using five different computational tools such as SIFT, PolyPhen 2, I-Mutant 3, SNPs&Go, and PhD-SNP. Studies on the molecular basis of OTC gene and OTCD have been done partially till date. Hence, in silico categorization of functional SNPs in OTC gene can provide valuable insight in near future in the diagnosis and treatment of OTCD.


Introduction
Ornithine transcarbamylase (OTC) catalyzes the formation of citrulline from carbamoyl phosphate and L-ornithine in the urea cycle, deleterious mutations in the human OTC gene disrupts the formation and produces clinical hyperammonemia, which can also lead to encephalopathy with subsequent neurological symptoms or even death. Ornithine transcarbamylase deficiency (OTCD) is the most common inborn error of urea cycle showing X-linked inheritance, which occurs at a predictable frequency of 1 in 14,000 births. Affected individuals show elevated levels of ammonia in their plasma and amplified urinary flow of orotic acid (Lopes-Marques et al. 2012). Males with OTCD show neonatal ammonia intoxication with severe or fatal neurological damage. Those with limited enzymatic OTCD may perhaps have a normal life span, but are at the peak intended for stress-induced hyperammonemic emergencies and incremental neurological damage. Females are carriers who might be asymptomatic, but often show some amount of protein intolerance (Maddalena et al. 1988). The human OTC gene is found on the short arm of the X chromosome with its cytogenetic location being Xp21.1. The size of the gene is 73 kb with an open reading frame of 1,062 nucleotides and holds 10 exons interjected by 9 introns of highly variable size. The OTC gene is expressed entirely in the liver and small intestinal mucosa. It translates a precursor OTC protein containing 354 amino acids and the amino end contains a spearhead sequence of 32 amino acids, which is cleaved in two steps upon integration into the mitochondrial matrix (Ogino et al. 2007). A polymorphism is a germline variation in the nucleotide base of the DNA molecule. As a rule of thumb inheritable variation is termed, a polymorphism if it is present at an allele frequency greater than 1 % in the general population, otherwise, at lower frequencies, it is considered as germline mutation (Strachan and Read 1996). Genetic polymorphisms are present throughout the genome of human. The most common type of polymorphism is single nucleotide polymorphism (SNP) that can occur in the frequency of about 1 out of every 300 nucleotide base pairs, and there are probably more than 10 million SNPs in the human population (The international HapMap and Consortium 2006). Polymorphisms can occur in both coding and non-coding region of the genes and may sometimes, particularly those within exons, have an impact on the structure and function of the protein coded by a particular gene, especially in those cases when the polymorphism leads to an amino acid substitution in evolutionarily conserved functional region of the protein.
A polymorphism that takes to an amino acid substitution and is present within an active site of an enzyme, at a substrate-binding site, a DNA-binding site or in other areas of the protein domains may affect the function of the encoded protein. This is particularly correct if the substituted amino acid has a different 3D structure or electrical charge than the wild-type amino acid, as this will alter the conformation or affinity of the enzyme, and make it nonfunctional, or more or less efficient than the wild-type protein (AliOsman et al. 1997;Hadi et al. 2000;Matullo et al. 2001;Pemble et al. 1994).
The loss of stability of proteins is one of the foremost causes of disease. As the proteins are only marginally stable, even small effects on stability alter the thermodynamic equilibrium to make the folded state unstable. Mutational data show that mutations often, if not in the majority of cases, cause significant changes to protein stability which are often on the order of magnitude of the absolute stability of the protein (Guerois et al. 2002). Lowered stability leads to a reduction in a protein's effective concentration, which in turn causes deficiencies in its ability to perform its biochemical function (Pakula et al. 1986).
Mutations in this OTC gene are the main reason for OTCD. Deleterious non-synonymous single nucleotide polymorphism (nsSNP) analysis for the OTC gene has not been projected computationally until now, while they are the center for new investigators. Therefore, in this work, the computational methods namely SIFT, PolyPhen 2, I-Mutant 3, SNPs&Go, and PhD-SNP were used to identify the deleterious nsSNPs that are expected to be affecting the function and structure of the OTC protein.

Dataset used for SNP annotation
Human OTC gene information data were collected from Online Mendelian Inheritance in Man (OMIM) (Amberger et al. 2009) and Entrez Gene on National Centre for Biological Information (NCBI). The SNP information of CBS was retrieved from the NCBI dbSNP (Sherry et al. 2001), and SWISS-Prot databases (Amos and Rolf 1996). Protein 3D structure was obtained from protein data bank (PDB) (Berman et al. 2000).

PolyPhen 2
PolyPhen2 (Polymorphism Phenotyping) predicts the functional effect of amino acid changes by considering evolutionary conservation, the physico-chemical differences, and the proximity of the substitution to predicted functional domains and/or structural features. A mutation is classified as ''probably damaging'' if the probabilistic score is above 0.85-1, mutation is classified as ''possibly damaging'' if the probabilistic score is above 0.15-0.84, and the remaining mutations are classified as benign (Adzhubei et al. 2010).

I-Mutant 3
SVM-based method I-Mutant 3 predicts the protein stability changes upon a single point mutation. It provides free energy change (DDG), which is calculated from the unfolding Gibbs free energy change of the mutated protein minus the unfolding free energy value of the native protein (Kcal/mol). It classifies the predictions in three classes: If DDG is \-0.5 = large decrease of stability, If DDG is between -0.5 and 0.5 = neutral stability and If DDG is [0.5 = large increase of stability (Capriotti et al. 2005).

SNPs&GO
It is a method based on SVMs that predict disease-associated mutations from protein sequence, evolutionary information and functions as encoded in the gene ontology terms. Moreover, it is a server for the predicting single point mutations, which cause disease in humans (Calabrese et al. 2009).

PhD-SNP
PhD-SNP uses SVM-Sequence method and SVM profile to classify the mutation into disease related and neutral polymorphisms. It predicts if the given nsSNP has pathological effect based on the local sequence environment of the mutation. It uses the most accurate mode that enables both sequence and evolutionary profiles (Capriotti et al. 2006).

Structural analysis
To evaluate the structural stability of native and mutant, protein structure analysis was performed. We used the web resource dbSNP to identify the protein coded by OTC. We also confirmed the mutation positions and the mutation residues from this server. These mutation residues and their corresponding positions were in complete agreement with the results obtained from the in silico prediction methods SIFT, PolyPhen 2, I-Mutant 3, SNPs&GO and PhD-SNP. The mutation was performed using SWISS-PDB viewer (Guex and Peitsch 1997), and energy minimization for 3D structures was performed by NOMAD-Ref server (Lindahl et al. 2006). This server uses Gromacs as default force field for energy minimization based on the methods of steepest descent, conjugate gradient and L-BFGS methods. Conjugate gradient method was used for optimizing the 3D structures. Deviation between the two structures was evaluated by their Root Mean Square Deviation (RMSD) values.

Results
A total of about 195 SNPs were collected and their deleterious natures were analyzed by various computational methods.
Analysis of deleterious SNPs using evolutionary-based prediction methods SIFT algorithm calculates whether an amino acid replacement may have an impact on protein function by aligning similar proteins and calculating a score which tells the evolutionary conservation status of the amino acid of our interest. SIFT scores were obtained for 195 SNPs. Analysis of deleterious SNPs using structure-based prediction methods The influences of nsSNPs in protein function were tested using structure-based predictors by applying it to three different methods. The structural levels of changes of 195 nsSNPs were determined by PolyPhen 2. To provide an outline of the distribution of PolyPhen 2 scores, the scores are distributed into three groups. PolyPhen 2 scores falling between 0.85 and 1 are expected to be ''probably damaging'' to protein structure and function. 157 (80 %) of the nsSNPs were found to have scores in the above-mentioned category. An additional 19 (9.7 %) of the variants exhibited PolyPhen 2 scores of 0.2-0.84, indicative of variants that are ''possibly damaging'' to protein function, and the remaining 17 (8.7 %) nsSNPs that scored less than 0.02 were designated as ''benign''. SNPs&GO makes use of sequence and evolutionary information to predict whether a mutation is disease related or not by developing the protein functional annotation. The protein sequences with corresponding UniProt accession numbers were submitted along with their corresponding mutational position, wild-type and mutant-type residue as input to the server. 98 % of the nsSNPs were designated as ''disease''. These mutants are found to be disease causing. PhD-SNP predicts the given nsSNPs have pathological effects based on the local sequence environment of the mutation. It classifies the SNPs into disease or neutral based on the most accurate mode that uses both sequence and evolutionary profiles. It showed 64 % of nsSNPs were likely to cause disease on mutation.

Prediction of stability changes
Mutated proteins involved in diseases show a stability change. Predicting the protein stability upon mutation is necessary for understanding structure function relationship of protein. Generally, the stability of a protein is represented by the change in the Gibbs free energy upon folding (DG), where an increasingly negative number represents greater stability. Single amino acid substitution in a protein sequence can result in a significant change in the protein's stability (DDG), where a positive DDG represents a destabilizing mutation and a negative value represents a stabilizing mutation. All the 195 nsSNPs submitted to pathogenic prediction tools were also subjected to protein stability analysis by I-Mutant 3.0. It gave an estimation of 107 nsSNPs (54 %) caused decreased stability, 48 SNPs (24 %) were neutral to the mutation, and 39 SNPs (20 %) increased the stability of protein after mutation. Out of 195 nsSNPs, 92 nsSNPs (47 %) were predicted to be positive by SIFT, PolyPhen 2, I-Mutant 3, SNPs&Go, and PhD-SNP (Table 1).

Structural analysis
According to the computational prediction in OTC gene, structural analysis was performed for the five highly deleterious variants by modeling mutant structures using  1OTH). An energy minimization study gives the information about the protein structure stability. We checked the total energy for native-and mutant-type structures. In OTC gene, mutation occurred for the native protein in 'A' chain of protein structure at position D126G, R141Q, A174P, T178M and G195R. It can be seen that the total energy value and RMSD of native-type and mutant-modeled structures (D126G, A174P, and G195R) were found to be higher ( Table 2). The mutations for 1OTH at their corresponding positions were performed by SWISS-PDB viewer independently to achieve modeled structures. Then, energy minimizations were performed by NOMAD-Ref server for the native-type protein 1OTH and the mutant-type structures. The RMSD values between the native type (1OTH) and the mutant D126G is 2.01 Å , between the native type and the mutant A174P is 2.82 Å , and between the native type and the mutant G195R is 2.82 Å , respectively. The deviation between the two structures is evaluated by their RMSD values, which could affect the stability and functional activity. The RMSD values of all the mutant structures were all alike. Higher the RMSD value more will be the deviation between native-and mutant-type structures and which in turn changes their functional activity. Superimposition of native with the mutant protein D126G, R141Q, A174P, T178M and G195R of OTC gene is shown in (Fig. 1a-e). The total energy for the native and mutant type structures were found to be -25480.939, -24899.660, -25068.101, -24881.020, -24969.936 and -24608.215 kcal/mol respectively ( Table 3).

Analysis of local environment changes
Within the range of 4 Å from the mutational point, surrounding amino acid changes were analyzed for native and mutant protein structures. It was observed through PyMOL (DeLano 2002). Figure 2 shows the substitution of hydrophilic residue aspartic acid to hydrophobic residue glycine at position 126, which leads to hydrophobic change at the core of the protein that could result in the destabilization of the gamma turns. The drift in hydrophilic to hydrophobic property can result in the gain of one amino acid LEU 131 in mutant structure. Figure 3 illustrates the substitution of the hydrophilic residue arginine with another hydrophilic residue glutamine at position 141, which leads to structural modification at the core region of the protein due to the size of the substituted amino acid, and that could result in affecting the strand portion. The changes in the amino acid size results in loss of four amino acids ARG330, HIS268, LEU139, and THR93 in mutant R141Q structure. Substitution of hydrophobic residue alanine with another hydrophobic residue proline and changes in the surrounding amino acids are shown in Fig. 4. Since the size of the substituted amino acid has the same size of the native residue, these changes were not affected the surrounding amino acids in A174 P-mutant structure. Figure 5 shows the substitution of non-polar hydrophobic amino acid glycine with polar hydrophilic larger amino acid arginine at position 195 of OTC protein. Substitution of small amino acid glycine with large amino acid arginine leads to gain of seven SER267, THR264, ILE200, ASP263, TRP265, ASN198, and LEU252 amino acids in the surrounding region of mutant structure. This change may affect the gamma turn of the native protein.
Substitution of polar hydrophobic amino acid threonine at position 178 with non-polar hydrophobic amino acid methionine is shown in Fig. 6. This substitution leads to gain of one amino acid in the mutant structure and this change may affect the helix region of the native OTC protein.  Secondary structural changes analysis The number of secondary structure elements such as Beta sheets, Beta-Alpha Beta, Strands, Helices, Helix-Helix Interacs, Beta Turns, and Gamma Turns was calculated for both the native and mutant models (Table 4). It has to note that the observed numbers of secondary structural elements are equal in both native and mutant models except the Helix-Helix Interacs and Beta Turns. There was a slight decrease in the number of beta turns in mutant models D126G, R141Q, A174P, T178M, and G195R as 15, 12, 15, 12, and 15, respectively. The number of beta turn was increased by one in three mutant models R141Q, A174P, and T178M. These secondary structural element changes lead to changes in the physiochemical properties of the mutant structure (Table 5) and it may affect the protein stability and conformation.

Discussion
Last decade has witnessed the accelerated expansion of information regarding the genomic variants especially SNPs in public databases as a result of improved second generation sequencing technologies. After polymorphism information has become abundant in public databases, many groups started to develop in silico tools that would computationally calculate the properties of these polymorphisms, particularly trying to extrapolate the effect of  polymorphism that has on the phenotype. If dataset on the phenotypic impact is unknown (owing to the insufficiency of clinical data or experimental) or not specified, most of the tools set out to identify whether a polymorphism is detrimental or not. Anyhow, in order for the identification, to be accurate, information had to be accumulated on the features distinguishing neutral from deleterious polymorphisms; many tools and algorithms that support large-scale analyses of SNPs (In particular nsSNPs). Various computational methods have been developed for predicting the significant missense mutations based on sequence and structural methods. With respect to the information utilized by the prediction, existing methods can be roughly grouped into three categories: 'sequence-based', 'structure-based' and 'sequence and structure-based', respectively. Sequence-based methods can be subcategorized into sequence homology-based and single sequence-based methods. Sequence homology-based method methods in  . Sequence homology-based tools are derived based on the premise that essential amino acids are conserved in the protein family. Hence, changes at well-conserved positions tend to be predicted as deleterious. This probabilistic method provides information about conserved sites in evolution that are often structurally or functionally important and distinguishes between missense mutations involved in disease and those that are functionally neutral. For sequence homology-based methods, the prediction accuracy depends heavily on the availability of enough homologs in protein databases. Saunders and Baker (2002) showed that the prediction accuracy decreased significantly when fewer than 5-10 homologous sequences are available. An ideal alignment should be composed of a diverse set of orthologous sequences rather than paralogs. Structure-based methods make predictions based on structural information, especially that of amino acid side-chain conformation, over  packing and residue-residue contacts (Gonzalez Diaz et al. 2005). The substitution of a wild-type residue may lead to altered chemical and physical properties, thus causing structural arrangements. The third method category combines information on the sequence features, the structural parameters and contacts to characterize the substitution. The incorporation of structural data greatly improves the quality of the multiple sequence alignment and the accuracy of prediction. This is well illustrated by PolyPhen (Ramensky et al. 2002), a multiple sequence alignment server that aligns sequences using structural information. It may outperform the single sequence-based program SIFT (Ng and Henikoff 2003) in predicting the effect of amino acid mutations. In addition to PolyPhen, diverse Webbased programs are used to predict mutation effects based on homology and three-dimensional structural models, e.g., PROMALS3D (Pei et al. 2008 (Zhou and Zhou 2005). The user only needs to provide sequences, the server runs BLAST to identify close homologues of the sequences within the PDB database. Study of the molecular basis of diseases using experimental methods is often labor intensive, and time consuming, especially in cases where there are several missense mutations causing the disease. These studies are difficult to mount on a scale that may be required for characterizing the genetic variants and at times these results might not always reflect the in vivo genotype function in humans. In contrast, precise and useful information about the effects of mutations on protein structure and function can be readily obtained by in silico methods. Our study gains significance by predicting the possible deleterious SNPs in OTC gene, so that the number of SNPs screened for association with diseases can be reduced to those that are most likely to alter gene function. All the above methods defined here follow a similar technique in which each SNP is first labeled with the properties related to damage it may cause on protein structure and function.  The resulting feature vector is then used to determine whether a single residue substitution has any effect on protein function or not. Considering SNPs based on the amino acid properties are generally reflected to be an important phenomenon in defining the protein folding, stability, and its function. The results from this paper signify the impact of mutations in OTC gene in causing OTCD. Further, studies possibly will help in uncluttered nature of OTCD. It is hoped that the results obtained from this study would pave the way by providing useful information to the researchers, and can play an important role in bridging the gap between biologists and bioinformaticians. of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information.