Background

Myristoylation is an irreversible, post-translational protein modification found in fungi, higher eukaryotes, and viruses, in which myristic acid is covalently attached via an amide bond to the alpha-amino group of an N-terminal glycine residue. The modification is catalyzed by the enzyme N-myristoyltransferase (EC 2.3.1.97), and occurs most commonly on glycine residues exposed during co-translational N-terminal methionine removal. Myristoylation also occurs post-translationally, for example when previously internal glycine residues become exposed by caspase cleavage during apoptosis [1].

Myristoylation can influence the conformational stability of individual proteins, as well as their ability to interact with membranes or the hydrophobic domains of other proteins [24]. If an attached myristic acid is exposed on a protein's exterior surface, it can loosely tether the modified protein to the plasma membrane, endoplasmic reticulum, mitochondrion, or other membrane system, providing enhanced opportunities to interact with other proteins localized nearby. Accessibility of the myristoyl moiety may be altered by ligand binding [5], changes in pH [6] phosphorylation [7], or proteolysis [8], reversing membrane localization.

Myristoylation plays a critical role in many cellular pathways, especially in the areas of signal transduction, apoptosis, and extracellular export of proteins. Animal proteins known to be myristoylated include protein kinases and phosphatases, calcium binding EF-hand containing proteins, guanine nucleotide-binding proteins, and membrane- and cytoskeletally-bound structural proteins [3]. Plant myristoylation has been directly measured in fewer cases, but is confirmed for proteins involved in growth regulation, disease resistance, salt tolerance and endocytosis. Examples of myristoylated plant proteins include a Rab GTPase required for endosomal sterol transport [9, 10], a calcium binding protein required for salt tolerance [11], and calcium-dependent protein kinases from Arabidopsis thaliana, Oryza sativa, Lycopersicon esculentum, Cucurbita pepo, and Solanum tuberosum [1216]. Additional plant functional families containing myristoylation sites have been identified biochemically from in vitro peptide studies [17, 18]. These include guanine nucleotide binding proteins, innate immunity proteins, thioredoxins, components of the protein degradation pathway, transcription factors, and fructose-2,6-bisphosphatase, a regulatory enzyme of glycolysis.

The ability to reliably predict myristoylation from sequence data alone is extremely useful in determining subcellular localization and function in cases where direct biochemical measurements are unavailable. Currently, four different prediction algorithms are available, but all have drawbacks which make them sub-optimal for predicting myristoylation in plant proteins.

The PS00008 myristoylation signature provided by PROSITE [19] was the first publicly available prediction algorithm. This method is still widely used, despite the fact that the myristoylation signature has not been updated since 1989, and is known to give a large number of false positive, as well as false negative predictions. One reason for the inaccuracy of these predictions is the small number of myristoylated sequences used to construct the signature. A second problem is that the amino acid choices available at each position are quite broad; as a result, only three of the six positions described are actually restrictive. Finally, more recent information indicates that amino acids downstream from the initial six included in the signature can also influence myristoylation site suitability [20].

A number of the PROSITE signature's deficiencies have been addressed by the "NMT Predictor" program [2023]. This program distinguishes between myristoylation sites for fungi and higher eukaryotes, which have been shown to have similar, but distinct specificities. The length of the prediction motif has been extended from 6 to 17 residues, and the number of higher eukaryotic sequences used for amino acid profile training expanded to 389. To improve specificity, structural data on the binding pockets of N-myristoyltransferases from fungal and mammalian species have been incorporated via a series of heuristic adjustment factors, which are subtracted from position specific scoring matrix (PSSM) results to obtain the final scores.

While these innovations improve accuracy for fungal and animal sequences, myristoylation prediction for plant sequences is still problematic. The scoring system recommended by Maurer-Stroh et al for the NMT Predictor [20] categorizes a large number of plant sequences (including some where myristoylation has been biochemically verified) into an ambiguous "twilight zone", where the algorithm is unable to distinguish positives from negatives. Perhaps this result is not surprising, since only 9 out of 389 training sequences and none of the enzymes used to analyze structural properties were derived from plants.

Recent work measuring activity of cloned Arabidopsis and yeast N-myristoyltransferases against synthetic peptide substrates has provided the opportunity to construct an improved, plant-specific myristoylation prediction algorithm. In this work, Boisson, Giglione, and Meinnel propose a model that attempts to correct plant-specific deficiencies in the NMT Predictor [17]. Surprisingly, their algorithm changes only the position specific heuristic adjustment factors from the NMT Predictor model, leaving the original, animal-biased amino acid PSSM intact. Although this strategy increases scores for biochemically verified plant positives, it also boosts scores for examples where myristoylation is highly unlikely, and the problem of many proteins giving ambiguous results remains unsolved.

Most recently, yet another myristoylation prediction method has been developed using the same positive training set as the NMT and BGM algorithms. The method proposed by Bologna et al [24, 25] uses average responses from an ensemble of 25 neural networks to model the data. Although this system separates positive from negative examples by a much wider margin than the NMT or BGM algorithms, it also assigns very high confidence levels to false negative misclassifications of plant sequences biochemically proven to be myristoylated, for example calcium dependent protein kinases from Cucurbita pepo [15] and Arabidopsis thaliana [12].

The current work describes the construction of a new, probabilistic model for myristoylation sites, based exclusively on plant-specific training examples. Plant proteins from 22 different species were selected based on four criteria: direct evidence for myristoylation, activity of peptide sequences as substrates for plant N-myristoyltransferase enzymes, subcellular localization, and N-terminal sequence conservation. The resulting model was tested and refined using statistical methods based on negative, as well as positive examples, to improve discrimination. Final model performance was compared with previously established prediction methods using a consistent set of quantitative statistical metrics. The model was then applied to both the Arabidopsis proteome and the entire set of available plant sequences from Genbank, to predict new plant proteins and functional families that contain myristoylation sites. The prediction model has been made available at http://plantsp.sdsc.edu/myrist.html.

Results and Discussion

Positive and negative data sets

The 80 plant sequences chosen as an initial positive training set are shown in Supplementary Table 1 [see Additional file 1]. Each sequence is 25 residues long, including the N-terminal methionine (which eventually will be cleaved in vivo). Although the majority of sequences are derived from Arabidopsis thaliana, the set also contains examples from 21 other plant species.

Plant proteins chosen for the negative test set appear in Supplementary Table 2 [see Additional file 1]. This set contains 185 N-terminal 25-mers, each of which has a glycine at position 2. The reason the set was limited to sequences with a glycine at position two is that all current models of N-myristoylation in plants, animals, and fungi stipulate that this residue is required for activity as a myristoylation substrate. Sequences lacking a glycine at position two would be classified as negative by all algorithms, and therefore uninformative in studies designed to compare performance. The negative examples chosen include nine proteins with N-terminal peptide sequences shown to be inactive as myristoylation substrates by in vitro experiments [17]. The remaining sequences have not been biochemically confirmed as non-myristoylated, but their annotated functions and subcellular locations make myristoylation highly unlikely.

Using the plant sequences from supplementary Tables 1 and 2 [see Additional file 1], scores for the NMT predictor ("NMT") the method of Boisson, Giglione, and Meinnel ("BGM"), and the Expasy Myristoylator ("Expasy") were obtained and plotted as frequency distributions. These distributions are compared to the scores obtained using a Hidden Markov Model (HMM) that was constructed entirely from plant sequences (Fig. 1). The plant-specific model HMM (HMM80) provides greater separation between positives and negatives than either the NMT or BGM method. Although the Expasy neural net provides even wider group separation than HMM80, a substantial number of sequences that should be positive have scores that appear highly negative, indicating a problem with classification accuracy.

Figure 1
figure 1

Distribution of myristoylation prediction scores. Scores for positive plant sequences are indicated by solid lines, and negative sequences by dotted lines. A, NMT algorithm. B, BGM algorithm. C, HMM80, D, Expasy algorithm. Scores have been pooled in histogram bins that are 5 score units wide for the NMT, BGM, and HMM data, and 0.1 score unit wide for the Expasy data.

Selecting a cutoff score for optimum sensitivity and selectivity

Once a scoring procedure has been established, a threshold cutoff value must be chosen in order to classify new examples. Overlap between known positive and known negative examples makes the process of choosing a cutoff difficult, often requiring some statistical compromises. The relative weight given to overall accuracy, versus the cost assigned to false positives and false negatives, can have a large influence on algorithm performance. "Optimal" weighting may vary for different user applications.

Cutoff values chosen for the NMT and BGM algorithm scores by their original authors do not provide maximum discrimination between positive and negative plant examples, as shown in Table 1. To allow for a more fair and consistent comparison between prediction methods, threshold values for NMT and BGM scores were re-evaluated using the Holte 1R algorithm [26], which guarantees the highest possible number of correct classifications in overall discrimination tests. This is the same algorithm that was used to set cutoff values for plant-specific HMMs. By adjusting the cutoff value downward from 0.0 to -2.75, overall accuracy of the NMT algorithm for plant sequences was improved from 87.9% to 95.1%, decreasing both false positives and false negatives. Accuracy of the BGM algorithm was improved from 89.1% to 93.6% when its cutoff value was adjusted upward from 0.0 to 1.85. However, for the BGM algorithm this improvement in overall accuracy could not be achieved without decreased sensitivity, reducing the detection of true positives.

Table 1 Quantitative performance comparisons for myristoylation prediction algorithms. Each model was evaluated using the same test set, consisting of 80 positive and 185 negative plant sequences. For the Expasy algorithm, only 183 negative sequences could be analyzed because two contained "X" characters (denoting ambiguous amino acids), which the program was unable to process. Abbreviations: TP, true positives; FN, false negatives; FP, false positives; TN, true negatives. Names in parentheses after each HMM indicate the sequence weighting scheme used to build the model.

Authors of the Expasy neural net algorithm have suggested using a cutoff value of 0.85 for "high confidence" and a value of 0.49 for "medium confidence" in predicting myristoylation. The Holte 1R algorithm indicates that 0.89 is a slightly more accurate cutoff value for discriminating between the positive and negative plant test sets analyzed here.

Adjusted cutoff values for the NMT BGM, and Expasy methods all gave higher accuracies than the PROSITE predictor, but none of these methods performed as well as HMM80, which gave 100% accuracy and 100% coverage when evaluated using the same test set. The coverage results for HMM80 are not surprising, since this algorithm was trained on the same positive set used for testing, but superior discrimination against false positives could not have been predicted a priori. The three variations on HMM80 listed in Table 1 were constructed using identical training sequences but differing weighting schemes, to determine whether decreasing biased sequence representation might change the outcome. The results indicate that weighting method was unimportant in this context.

Over-representation of some sequence families (e.g. calcium dependent protein kinases) relative to others in the sequence sets used for algorithm testing could potentially lead to biased measurements of model performance. To eliminate this possibility, test sets were filtered to remove redundant sequences, then used to re-evaluate algorithm accuracy (Fig. 2). Positive and negative sets were filtered as a single unit, so that no pair of sequences with greater than the indicated level of similarity remained. The results of these experiments indicate that accuracy measurements for the Prosite, NMT, BGM, Expasy, and plant HMM models are stable and consistent as the original test set is progressively pruned to remove redundancy. Even at test sequence similarity levels of 40% or less, relative performance of the algorithms remains unchanged, with the HMM80 model clearly superior to the others.

Figure 2
figure 2

Effect of test set pruning on algorithm performance measurements. Accuracy scores (number correctly classified/total number of examples tested) were determined for the following algorithms: HMM80 (closed circles), NMT (open squares), BGM (closed triangles) Expasy (closed squares), and Prosite (open circles). Test set similarity refers to the maximum number of amino acid matches permitted between any two sequences in the set. Number of sequences tested were a) 80 positive, 185 negative (original, unfiltered test set, 96% maximum similarity), b) 63 positive, 128 negative (80% maximum similarity), c) 55 positive, 102 negative (60% maximum similarity), and d) 44 positive, 94 negative (40% maximum similarity).

ROC analysis [27, 28], an independent measurement of positive/negative discrimination, was applied to the NMT, BGM, Expasy, and plant-specific HMM prediction algorithms, with similar results (Fig. 3). HMM80 had a total area under the curve (ROC value) of 0.969, compared to 0.892 for NMT 0.874 for BGM, and 0.811 for Expasy. The higher ROC value observed for HMM80 confirms its superior ability to discriminate between positive and negative plant examples. Based on the set of test sequences used in this analysis, the probability that a negative sequence might receive a higher score than a positive one is greater than 10% for the NMT, BGM and Expasy prediction methods, but only about 3% for HMM80.

Figure 3
figure 3

Receiver operating characteristic (ROC) analysis. Receiver Operating Characteristic (ROC) curves were determined using a plant-specific test set of 80 positive and 185 negative sequences. Prediction models tested are distinguished by line type: HMM80, solid black; NMT, solid grey; BGM, dotted black, Expasy open circles. Areas under the curves, an indication of model performance, were: HMM80, 0.969; NMT, 0.892; BGM, 0.874, Expasy 0.811. Higher numbers indicate better performance.

Maximizing algorithm robustness

Although the initial HMM constructed from plant specific sequences performed well in test set discrimination, jack-knife testing by leave-one-out cross validation suggested that this model might have difficulty generalizing to a broader data set. In order to improve robustness and reduce the possibility of over-fitting, a second generation of HMMs were constructed by adding extra training sequences to the original set, as a form of bootstrapping. These sequences, shown in supplementary Table 3 [see Additional file 1], were drawn from 348 previously unclassified plant examples, all of which gave scores greater than 2.05 when tested with HMM80. Several different subsets of the supplementary sequences were used for algorithm construction, progressively increasing cutoff threshold values to reduce the likelihood of introducing false positives.

In selecting a final, "best" HMM model, several different criteria were considered, as shown in Table 2. HMMs broadened by adding sequences with a cutoff score that was too lenient (e.g. 2.0) caused a significant decline in predictive accuracy. Choosing a very restrictive cutoff score for additional sequences (e.g. 12.5) reduced the total number of sequences available, and gave lower robustness in the jack-knife test. HMM266, the model finally chosen for best ability to generalize, with minimum sacrifice of sensitivity, was trained using 186 bootstrapped sequences with a cutoff value of 5.8 combined with the original 80 sequences used in HMM80.

Table 2 Selection of bootstrap cutoff values for plant-specific HMM training sets. Each row shows the results of building a new HMM using 80 initial training sequences plus the number of supplementary bootstrap sequences shown. "Inclusion threshold" indicates minimum score of the bootstrap sequences using HMM80. Accuracy and coverage were determined using the same positive and negative test set for each HMM. Jack-knife (leave-one-out) testing for each HMM was performed against the same training set used in model construction.

The cutoff value for HMM266 was further analyzed using a P-value statistic, to estimate the probability that a random plant sequence might score higher than this threshold. The results of this calculation, (Fig. 4), give a P-value (log probability) of 1.46 at a score of 0.55, suggesting the probability of obtaining a false positive purely due to random sampling would be about 0.0341.

Figure 4
figure 4

P-value determination for plant-specific HMM scores. The highest scoring match was determined for each predicted protein in the Arabidopsis thaliana genome. P-values (log probabilities) were calculated based on score frequencies by plotting the logarithm of observed probability (expressed as rank divided by number of sequences) against HMM266score.

Cross validation

HMM266 was cross validated by building a new model (HMM186B ) using its 186 bootstrap component sequences for training and reserving the original 80 positive sequences (from HMM80) for testing. These results are shown in the first row of Table 3. To eliminate the possibility of artifacts due to sequence selection bias, additional cross-validations were performed by pooling the bootstrap, positive, and negative sets, and filtering to remove redundant sequences at sequence match levels of 80%, 60%, or 40% using the CD-hit program [29]. The filtered sequence pools were then re-divided into non-overlapping training, positive, and negative test sets, based on original sequence derivation, and used to build additional HMMs. These additional cross-validation models are identified by a subscripted "B" in their model names (Table 3). The results indicate that models built on bootstrapped sequences alone can consistently detect true positives and reject false positives with accuracies better than 96%, even when test and training sequences have been pruned to share no more than 40% amino acid similarity.

Table 3 Effect of sequence redundancy on algorithm cross-validation performance. HMM models were constructed using the 186 bootstrap sequences used to train HMM266, then tested for accuracy and coverage against non-overlapping positive and negative test sets. "Max. sequence similarity" refers to the maximum number of amino acid position matches allowed for the sequences in a given row, either within or between test and training sets. Jack-knife (leave-one-out) testing for each row was performed against the training set described in that row.

Minimizing algorithm ambiguity

One criticism of previously available methods for predicting plant myristoylation has been that a large number of database sequences fall in a "twilight" zone, where their scores are intermediate between positive and negative, making them difficult to classify. Table 4 shows an experiment comparing previously available algorithms with HMM266 for ambiguity in classifying unknown sequences. Testing all 7230 currently available plant sequences from Genbank that have a glycine at position 2, only 73 sequences (1%) had potentially ambiguous scores when tested with HMM266. In contrast, 1684 sequences (23.3%) were ambiguous with the NMT algorithm, and 1564 (21.6%) with the BGM method.

Table 4 Ambiguity resolution for plant sequences with N-terminal sequence "MG". Each prediction model was used to score the same 7230 unclassified plant sequences containing a glycine at position 2. Highest negative and lowest positive scores were determined using the 265 plant-specific classified examples described in Table 1.

Predicted myristoylation sites

When HMM266 was used to analyze the amino terminal residues of 257,027 plant sequences from Genbank, 319 sequences from Arabidopsis and 268 sequences from other plant species were identified as potential myristoylation substrates (supplementary Tables 4 and 5 [see Additional file 1]). The 319 Arabidopsis sites represent 1.1% of the total proteome, a number somewhat higher than previously predicted by the NMT algorithm (198), but lower than the BGM method (437).

The distribution of functional families containing predicted N-terminal myristoylation sites in Arabidopsis thaliana is summarized in Table 5. Although more than a third of the sequences are found in proteins whose function has not yet been determined, a very high proportion of the rest are closely associated with signal transduction. Seventy-six of the sequences are protein kinases, including 28 of 34 known calcium-dependent protein kinases (CPK) and all eight known CPK-related protein kinases (CRK). The most common groups among the rest of the protein kinases were APK1 related, cyclin dependent, and cdc2 related. Also included was an Arabidopsis homolog of the Pto gene product, a serine/threonine protein kinase that confers disease resistance in tomato [30].

Table 5 Functional families of proteins predicted to be myristoylated in Arabidopsis thaliana. HMM266 scores were determined for the N-terminal 25 residues for all predicted proteins in the Arabidopsis thaliana genome. Proteins with scores above the threshold cutoff value for positive classification (0.55) were grouped according to annotated protein function.

Twelve Arabidopsis proteins predicted to be myristoylated are protein phosphatase catalytic subunits, including 11 of the PP2C class. Of the non-PP2C phosphatases, one is fructose-2,6-bisphosphatase, a key enzyme in regulation of glycolysis. Homologs of this enzyme in spinach (Spinacia oleracea) and mangrove (Bruguiera gymnorrhiza) were also predicted to be myristoylated. The other Arabidopsis phosphatase is of type PTEN, a dual specificity enzyme that can act on either the lipid phosphatidylinositol (3,4,5)-triphosphate, or phosphotyrosine residues in proteins. In Arabidopsis, the expression of this enzyme is pollen-specific and essential for pollen development [31].

Fifteen of the Arabidopsis proteins containing predicted N-terminal myristoylation sites are GTP binding proteins, including 13 ADP-ribosylases, one Rab-type GTPase involved in endosomal regulation, and a G protein alpha subunit. This result is consistent with the well known myristoylation of ADP ribosylation factors and G-protein alpha subunits in animal species.

Ten calcium binding proteins were predicted to be myristoylated in Arabidopsis, including five members of the copine family and five calcineurin B-like proteins, at least one of which (SOS3) has been shown to activate an SnRK family kinase [11]. Additional calcium-binding proteins with predicted N-terminal myristoylation sites were identified in other plant species. StubGAL83, a calcium binding protein from potato, interacts with SNF1, another SnRK family kinase, which controls expression of glucose-repressible genes and regulates histone kinase activity in yeast [32, 33]. PGPS/D3, a calcium binding protein required for pollen germination in Petunia, is similar to the myristoylated mammalian protein neuromodulin [34]. Pectate lyase, a calcium-binding enzyme essential for cell wall elongation and fruit ripening, was predicted to contain an N-terminal myristoylation site in sequences in both Arabidopsis and rice. Several homologs of the DEM (defective embryo and meristems) gene product of tomato were also observed.

Disease resistance pathways also contain a high number proteins with N-terminal myristoylation sites. In addition to the copine family members [35], and the Pto kinase, predicted myristoylation positives in Arabidopsis included 18 disease resistance proteins of the NBS-LRR type (nucleotide binding site-leucine rich repeat). The six oxidoreductases from Arabidopsis, including four thioredoxins and two glutathione peroxidases, could also represent proteins involved in disease resistance responses, via production of an oxidative burst [36]. Thioredoxins with predicted N-terminal myristoylation sites seem to be highly conserved, and were also found in 10 other plant species (Oryza sativa, Pisum sativum, Zea Mays, Leymus chinensis, Hordeum bulbosum, Hordeum vulgare, Phalaris coerulescens, Ipomoea batatas, Triticum aestivum and Brassica rapa). An Arabidopsis cytochrome P450 related protein identified as myristoylated could be involved de-activating the products of this pathway via oxidative degradation, for example through fatty acid hydroperoxide lyase activity [37].

Several Arabidopsis transcription factors were identified as having N-terminal myristoylation sites, including four from the basic leucine zipper family, three from the myb family and, one of the WRKY type. Although these proteins may ultimately need nuclear localization to fulfill their functional roles, they could reside temporarily in cytosolic locations that require myristoylation. Light dependent changes in DNA binding activity of several plant bZIP transcription factors have been shown to involve both phosphorylation and subcellular translocation from cytoplasm to nucleus [38].

A number of Arabidopsis proteins in the ubiquitin-dependent protein degradation pathway also had predicted myristoylation scores slightly above threshold cutoff values. These sequences included six F-box proteins, two ubiquitin proteases, and 26S proteasome regulatory subunit IV. The 26S proteasome regulatory subunit IV was also identified as myristoylated in the moss Tortula ruralis. These results are consistent with recent biochemical evidence verifying N-terminal myristoylation in the Rpt2 subunit of the 26S proteasome of yeast [39].

It is possible that not all predictions of the current algorithm are correct. Three Arabidopsis proteins with domains suggesting exclusively nuclear functions were also predicted to be myristoylated: a DEAD/DEAH box helicase, a putative Dhp1 exoribonuclease, and DNA mismatch repair protein MSH3. These examples may represent false positive predictions, consistent with the error rate of around 3 per 100 sequences predicted by both the ROC curve and P-value statistics for HMM266.

Cryptic internal myristoylation sites

In addition to the 319 N-terminal sequences, HMM266 found 301 potential myristoylation sites in the Arabidopsis proteome beginning at internal rather than amino terminal positions, which are listed in supplementary Table 6 [see Additional file 1]. The internal sites would not normally be myristoylated in vivo unless post-translational cleavage occurs, unmasking a new N-terminal glycine. However, some predicted internal myristoylation sites could be indicative of gene prediction errors, for example choosing the wrong methionine residue as a start site, or fusing two distinct proteins into a single predicted gene product.

The distribution of all predicted plant myristoylation sites for Arabidopsis thaliana is plotted according to score and start position in Fig. 5. After peaking at position 1, both numbers and scores of predicted positives generally decline with increasing distance from the amino terminus. The reliability of the internal site predictions remains somewhat uncertain; a third of the proteins identified are either hypothetical or of unknown function, and neither protease cleavage sites nor database errors can currently be verified without additional experimental data.

Figure 5
figure 5

Sequence positions of predicted myristoylation sites in Arabidopsis thaliana proteins. HMM266 scores were determined for all predicted proteins in the Arabidopsis thaliana genome. All scores above the threshold cutoff value for positive classification (0.55) are shown plotted against start position of the matching pattern.

The total number of potential internal myristoylation sites in the Arabidopsis proteome predicted by HMM266 is more than 50 fold lower than the number predicted by the PROSITE myristoylation signature (162,183). It was not feasible to determine the total number of internal Arabidopsis sites for the NMT algorithm, because the NMT Predictor website accepts only one sequence at a time for analysis. Information on the total number of internal sites was also unavailable for the BGM model, which requires NMT profile scores as a prerequisite to making final calculations.

Comparison of plant and animal substrate specificity

In building a substrate specificity model for evolutionarily conserved enzymes like N-myristoyltransferase, a choice must be made as to the breadth or narrowness of the species range used for training sequences. Limiting taxonomic breadth is likely to improve algorithm performance on closely related organisms, but may make the resulting model less suitable for more distant species, as shown in Fig. 6. The NMT BGM, and Expasy algorithms are built on the same underlying amino acid profile, heavily weighted towards animal examples. This profile gives scores that are consistently lower for plants than animals. Conversely HMM266, built exclusively from plant examples, gives lower scores for animal protein sequences than for plants. This suggests that plant and animal N-myristoyl transferases do, in fact, have differing target specificities.

Figure 6
figure 6

Comparison of myristoylation prediction scores for sequence sets from plants and animals. Scores for plant sequence sets are indicated by solid lines, and animal sequence sets by dotted lines. Panel A, NMT raw profile scores. Panel B, HMM266 scores.

N-myristoyltransferase substrate specificity for plants and animals can also be compared by observing position specific amino acid frequencies, listed in Table 6. Differences between these frequencies are shown as Kullback-Leibler distances (relative entropies) in Fig. 7. Structural studies have indicated that amino acid residues 2–6 fit within the binding pocket of N-myristoyltransferase, while subsequent positions act as a linker region [21].

Table 6 N-terminal amino acid frequencies Amino acid frequencies were calculated based on 247 non-plant proteins classified as myristoylated by the NMT model [49], and 587 plant proteins predicted to be myristoylated by HMM266. Bold values indicate relative frequencies for myristoylated plant proteins, italic values for myristoylated animal proteins.
Figure 7
figure 7

Kullback-Leibler distance (relative entropy) between N-terminal amino acid sequences of myristoylated plant and animal proteins. Amino acid frequencies used to calculate relative entropies were obtained from 247 non-plant proteins classified as myristoylated by the NMT model [49], and 587 plant proteins predicted to be myristoylated by HMM266. Position 1 in this figure refers to the N-terminal methionine (for consistency with sequence database numbering), even though this residue must be removed in vivo for myristoylation to occur.

The greatest differences between the amino acid distributions for plant and animal myristoylation substrates occur at positions four and five, where many of the plant sequences, but few animal proteins contain a cysteine residue. The presence of a cysteine group near the N-terminus in myristoylated proteins has been shown to be associated with subsequent palmitoylation, a post-translational modification that increases the stability of membrane association [3]. Approximately half of the plant proteins identified as myristoylated contained a cysteine at positions three, four, or five, in both Arabidopsis (171/319) and non-Arabidopsis examples (140/268). These results suggest that either myristoyltransferase specificity differs at these positions, or that the combination of myristoylation with palmitoylation may occur more frequently in plants than animals.

Less significant amino acid frequency differences occur between plants and animals at positions 6, 7, 13, 17, and 19. Serine and threonine are the most common amino acids at position 6 in both plants and animals, but are somewhat less frequent in plants. This is consistent with a role for this position in stabilizing the enzyme-substrate complex via hydrogen bonding, as has been demonstrated in yeast. Perhaps hydrogen bonding in this position is less critical for plant myristoylation sequences, which may use glycine instead. At position 7, basic amino acids predominate for both plants and animals, but plants can use arginine here as well as lysine, suggesting a larger or more flexible enzyme cavity space in plants. This result is consistent with the observation that a cloned N-myristoyltransferase from Arabidopsis thaliana is active against peptide substrates with a much wider range of amino acids at position 7 than the corresponding cloned enzyme from Saccharomyces cerevisiae [17]. Differences between plant and animal sequences at positions 13, 17, and 19 are more complex, with no one single type of amino acid side chain predominating.

Conclusions

The plant specific N-myristoylation prediction method HMM266 has been used to predict 319 myristoylation substrates in the Arabidopsis thaliana proteome, along with 268 additional examples in other plants. The functional families where these proteins occur are highly representative of signal transduction pathways, especially those involving protein kinases, protein phosphatases, small GTP-binding proteins, and calcium binding proteins. Plant specific physiological functions that depend on proteins predicted to be myristoylated include responses to stresses such as wounding salt, drought, and pathogen exposure, as well as developmental events related to pollen tube extension, meristem formation, cell wall extension, and fruit ripening.

HMM266 is more sensitive in detecting known positives and more selective in avoiding false positives than previously available alternatives, including the PROSITE myristoylation motif, the NMT Predictor [20, 21] the BGM modified NMT algorithm [17], and the Expasy Myristoylator [24] The use of several independent statistical methods for algorithm validation, including Receiver Operating Characteristic, P-value, and jack-knife testing, adds confidence to the predictions. The new algorithm gives much wider separation between positive and negative scores than the NMT and BGM methods, allowing more than 20-fold reduction in the number of unclassified sequences giving ambiguous, uninformative results.

Superior performance of HMM266 is due to the selection of a plant-specific training set, covering 266 unique sequence examples from 40 different species. Previously available methods rely strategically on pre-selection by a relatively unrestrictive amino acid profile, followed by subtraction of heuristic adjustment factors to remove false positives. Because adjustment heuristics are based on a relatively small number of negative examples, they may be unable to generalize, and insufficiently restrictive, causing overestimation of negative scores. In addition to overestimating negative scores, the NMT and BGM prediction methods tend to underestimate scores for positive plant sequences because the underlying training set used to determine the profile is highly biased against plants.

The use of a probability based HMM to obtain predictive scores has effectively extracted the most relevant amino acid structural information for each individual position from the statistical relatedness of the training set, making additional heuristic adjustments unnecessary to achieve classification accuracy. As new biochemical information becomes available, the machine learning approach to model building used here should be easier to update than a heuristic approach. The same set of objective, quantitative validation tests used in the current study can then be readily applied to the evaluation of future models.

Methods

Selection of plant-specific positive and negative sequence sets

Myristoylation positive sequences were initially obtained by searching the scientific literature for plant proteins with N-terminal myristoylation documented by biochemical experiments. This group was supplemented with N-terminal sequences that matched peptides identified in vitro as active myristoylation substrates [17, 18]. The positive set was further supplemented with proteins whose N-terminal amino acids sequences matched the seven initial amino acids of biochemically verified examples (from either plants or animals), having the same subcellular location and biochemical activity. All positive sequences were truncated at a length of 25 amino acids, to reduce computational complexity.

For model testing, a complete set of predicted proteins for Arabidopsis thaliana was obtained from the 4/17/2003 release of the TIGR Arabidopsis thaliana Genome Annotation Database [40]. A set of myristoylated non-plant sequences was also assembled, in order to assess whether the prediction algorithms showed any taxonomic bias. This set was obtained by selecting all 247 unique, non-plant examples from the list of higher eukaryote training sequences used by Maurer-Stroh et al [21], then extending sequence length (originally 17 amino acids) to 25 residues, based on full sequence data from the SwissProt database.

Negative sequences, used to assess algorithm specificity, were chosen based on an examination of sequence annotation suggesting that myristoylation was highly unlikely. Only sequences with a glycine residue at position two were selected. These functional negative candidates were initially selected using the annotation keywords DNA polymerase, helicase, ribonucleoprotein, polymerase, ribosomal, and histone. The set was then manually curated to remove any potentially ambiguous candidates, for example transcription factors containing a basic leucine zipper motif (previously shown to contain a peptide sequence with high activity as a myristoylation substrate activity), as well as proteins related to c-Myb, known to be highly acetylated at leucine residues [41]. The final negative set contained 185 sequences (Supplementary Table 2 [see Additional file 1]).

Sequence filtering

For some experiments, sequence sets were filtered to remove redundancy using the clustering program CD-hit [29, 42]. This program creates clusters using a greedy algorithm, first selecting seed sequences by length, then finding sequences with identities to a particular seed at greater than or equal to a specified threshold value (for example 60%). The program then constructs a filtered output set containing one sequence from each cluster.

Generation of models

Profile Hidden Markov Models (HMMs) were generated using the hmmbuild function of software package HMMER 2.3.1 [43, 44] from input sequences 25 amino acids in length, aligned without gaps. Sequence weighting was performed using both the program's default algorithm [45] and two alternate methods, Henikoff [46] and Voronoi [47]. Test sequences were evaluated for model fit on the basis of bit score values generated by the hmmpfam function of the HMMER software.

In some experiments, broader, more generalized HMMs were constructed by supplementing the original set of plant-specific positives with high scoring hits from a first round of HMM testing against unclassified plant sequences from Genbank. Second generation HMMs were built using training sets supplemented with between 29 and 348 additional sequences, bootstrapped from initial results by including all sequences above a specified threshold.

Models were tested for sensitivity and specificity by evaluation against classified positive and negative sequences, as described below. The robustness of each HMM was tested by leave one out cross-validation (jack-knife test), where a new HMM was generated by leaving out each individual sequence of the training set in turn. Each new HMM generated was then tested for the ability to detect all of the sequences used in model construction.

HMMs were also evaluated using a P-value statistic. Because the bit score used for classification is based on the maximum match value for each protein sequence, the scores obtained should follow an extreme value distribution. In such a distribution, P(score > x) ~ Ce-λx+c, where C, c, and λ are constants. For randomly selected protein sequences, the logarithm of observed probability (expressed as rank divided by number of sequences) plotted against score should be linear. The point at which this plot deviates from linearity can be used as a threshold separating true positives from random scores [48]. To obtain P-values corresponding to HMM values, the highest scoring match was determined for each predicted protein in the Arabidopsis thaliana genome, then plotted against the logarithm of observed probability (expressed as rank divided by number of sequences). A line was fitted to the central part of the curve to obtain expected P-values.

Comparative evaluation of prediction algorithms

Positive, negative, and unclassified plant sequences were scored for predicted myristoylation sites using Prosite pattern PS00008 [19], the higher eukaryote settings for the "NMT Myr Predictor" website [49], the modified predictor pattern of Boisson, Giglione, and Meinnel [17], and the Expasy Myristoylator website [25] as described by the authors. The PROSITE profile gives a simple positive/negative output, but the NMT Predictor (NMT), Boisson, Giglione, and Meinnel (BGM), and Expasy Myristoylator (Expasy) methods produce numerical scores. Cutoff values for distinguishing positives from negatives were optimized for each quantitative prediction algorithm using the 1R classification method of Holte [26], as implemented in the WEKA open source software package, version 3.2.3 [50]. Briefly, this method seeks to maximize accuracy using a classified set of positive and negative examples, which are sorted in numerical order, then discretized into bins, each bin containing no fewer than a specified minimum number of instances with the same classification. The cutoff value is chosen to provide the highest possible accuracy without splitting the minimum bin.

The 1R method was applied to each algorithm using the sequence sets described above (80 positives plus 185 negatives), with bin size set to the smallest possible value between 2 and 10 allowing separation into exactly two classes. Overall accuracy, coverage, false positive, and false negative rates were tabulated first using all 265 test sequences, then re-calculated using 10-fold stratified cross-validation to confirm that results were not significantly different within any subset of the data.

Algorithm accuracy was also evaluated using Receiver Operating Characteristic (ROC) analysis, a threshold independent test for sensitivity and selectivity widely used in clinical medicine [27, 28]. In this test, classified positive and negative examples are ranked by score in decreasing order, and used to construct a plot of sensitivity (fraction of positives) on the y axis, versus 1 – specificity (fraction of negatives) on the x-axis. The area under the ROC plot is related to the rank-sum test for two independent samples (Mann-Whitney or Wilcoxon test). This area is calculated to determine the probability that a randomly selected true positive case might receive a higher score than a randomly selected true negative. Higher scores indicate greater reliability.

Algorithm ambiguity was tested by obtaining scores for 7230 unclassified plant sequences from Genbank, each containing glycine at position two. The highest score in the negative sequence set and the lowest score in the positive sequence set were used as upper and lower bounds. The percentage of scores falling between these limits were calculated as a measure of potential ambiguity for each algorithm, since these examples could prove difficult to classify definitively.

Amino acid frequencies were calculated for each of the first 25 N-terminal positions, based on the set of 247 non-plant training examples used by Maurer-Stroh et al [21], and on all 587 plant examples predicted to be myristoylated by the final plant-specific HMM. To measure distance between the two data sets, frequencies were used to calculate relative entropy according to the following formula,

where H is entropy, P is the amino acid distribution for plant sequences, and Q the amino acid distribution for animal sequences [51]. Division by zero was avoided by the addition of one prior count to each frequency numerator and denominator before entropy calculation.