Background

The methylation of proteins is of increasing biological interest. It is predominantly found on lysine and arginine residues, but has also been found on histidine, glutamic acid and on the carboxyl groups of proteins (reviewed in Grillo and Colombatto 2005) [1]. Methylation of lysine involves the addition of one to three methyl groups on the amino acid's ε-amine group, to form mono-, di- or tri-methyllysine. Its function is best understood in histones. Methylation on the tails of histone proteins, in conjunction with acetylation and phosphorylation, controls their interaction with other proteins, affects chromatin compaction and the up- or down-regulation of gene expression [2]. For S. cerevisiae, lysine methylation is found in histone H3 and histone H4 [3]. Tri-methylation at H3K4 and H3K36 is positively correlated with gene activity [4], while H3K79 are involved in gene silencing [5, 6]. Histone H3K79 methylation is evolutionarily conserved and is involved in several pathways, including Sir protein-mediated heterochromatic gene silencing [7]. meiotic checkpoint control [8] and in the G1 and S phase DNA damage checkpoint functions of Rad9p [9, 10]. While studies of lysine methylation have mainly focused on histone proteins, several non-histone proteins are also known to be lysine-methylated. They are mainly ribosomal proteins or proteins involved in protein translation [11], and include Rpl12p [12, 13], Rpl23p [12, 14], Rpl42p [15], and eEF1Ap [16].

The methylation of arginine involves the addition of one or two methyl groups to the amino acid's guanidino group, forming mono- or di-methylarginine. It is predominantly known to be associated with RNA regulation and processing [17]. In S. cerevisiae, Hmt1p is a type 1 arginine methyltransferase that catalyses the formation of mono- and asymmetric di-methylarginine. This enzyme is known to methylate a number of proteins that contain an RGG-motif; these include Npl3p, Hrp1p, Nab2p, Gar1p, Nop1p, Nsr1p, Yra1p, Sbp1p, and Hrb1p. These proteins have been implicated in poly(A)+ mRNA binding, processing and export [17], ribosome biogenesis [1820] and gene silencing [21]. Moreover, methylation is required for the nuclear export of RNA binding proteins Npl3p, Hrp1p, and Nab2p [22, 23]. The repeated RGG-motif was known as a RNA-binding motif [24], and this also supports the role of arginine methylation in the regulation of mRNA binding [25]. The methylation of nuclear shuttling proteins is suggested to weaken their binding with cargo proteins and disrupt their export from the nucleus [26]. Arginine methylation is also known to facilitate or block protein-protein interactions. Arginine methylation of SmB protein facilitates the binding of tudor domains in SMN, SPF30, and TDRD3 proteins [27]. In contrast, arginine methylation of Sam68 blocks the interaction of nearby proline-rich motif with an SH3 domain, but not to a WW domain [28]. More examples on methylarginine-regulated interactions are reviewed in McBride and Silver (2001) [29] and Bedford and Clarke (2009) [30].

There have been several studies to identify arginine or lysine-methylated proteins on a proteome-wide scale. In the first of these studies, arginine-methylated protein complexes were purified from HeLa cell extracts using anti-methylarginine antibodies specific against RG-rich sequences [31]. This resulted in the identification of over 200 arginine-methylated proteins, involved in pre-mRNA processing, protein translation, and DNA transcription. However the actual methylation sites on these proteins remain unknown [32]. The second study utilised stable isotope labelling by amino acid in cell culture (SILAC), in which [13CD3]methionine was converted to [13CD3]S-adenosyl methionine, the substrate for arginine and lysine methylation [32]. Advantages of this method included increased confidence of identification, a capacity to distinguish between trimethylation and acetylation which are near-isobaric, and the ability to quantify the relative changes in methylation status of a protein between two samples. In combination with anti-methyllysine and anti-methylarginine antibody immunoprecipitation techniques, Ong et al. (2004) [32] was able to identify methylation on histones from HeLa cell extracts, such as on histone H3K27. Around 30 other proteins were also found to be methylated at RG-rich motifs and most of these proteins are RNA binding or associated with mRNA processing pathways. The third study used anti-methyllysine antibodies to search for organ-specific lysine methylation in Mus musculus [33]. Proteomic analysis of brain tissue extract by 2-D PAGE, western blotting, and MALDI-ToF peptide mass fingerprinting identified the following lysine-methylated proteins: neurofilament triplet-I protein, Hsc70 protein, creatine kinase, α-tubulin, α-actin, β-actin, and γ-actin. Furthermore, α-actin and creatine kinase were found to be methylated in muscle tissue.

The use of tandem mass spectrometry to discover new protein post-translational modifications is common [34]. However, peptide mass fingerprinting can also be used to search for new PTM sites [35]. The FindMod program [35] caters for this approach. It requires peptide mass spectra from a mostly pure protein, for example a spot from 2-D gel, and examines experimental peptide masses for differences in mass with theoretical peptides for that protein that correspond to post-translational modification. Peptides that are potentially modified are checked to see if they contain amino acids that can carry the modification. Where very high accuracy peptide mass measurements can be made, for example with new instruments like the prOTOF2000, high confidence predictions are possible. Parent-ion masses from tandem mass spectrometry data can also be used in FindMod, where it may serve as an initial screen for PTMs before employing more sophisticated and computationally expensive methods [36, 37].

Here we describe a strategy for the discovery of methylation on a global scale, using peptide mass fingerprinting data, and implement this to search for methylated lysine and arginine residues in the yeast proteome. A proteome-scale set of MALDI-ToF mass spectra [38] was analysed for putative methylated peptides. The application of 5 filters yielded high-confidence methylation sites that were then further investigated to understand where they are found in protein sequences and their likely function.

Results

Large-scale methylation discovery in yeast peptide mass spectra

FindMod was used to analyse peptide mass spectra for 2,607 yeast proteins out of a total ~6,500 (representing 40% of the total proteome) for the presence of mono- and di-methylation. A tailor-made mass tolerance was calculated for each spectrum to reduce spurious peptide matches; the average of this for all spectra was ± 0.04 Da. Of all the 24,105 FindMod queries, there were 17,471 matches to potentially methylated peptides (Figure 1). Five filtering strategies, used sequentially, were then applied to this set to find methylation sites of very high confidence. The first filter removed peptides that matched to unmodified peptide sequences as these peptide masses are likely to be unmodified peptides. Conversely, peptides masses that did not match to unmodified peptide sequences are likely to be modified, and these were analysed with the second filter. The second filter removed any peptides that contained D or E residues, as artifactual methylation may result from partial methyl esterification of D or E residues [39]. The third filter was designed to take advantage of redundancy within each FindMod output, by removing one-off or spurious mass spectra. It searched for modifications that were found in two or more overlapping peptides (Figure 2), and took advantage of the reduced efficiency of tryptic cleavage at methylated residues [32], where overlapping peptides with missed cleavages were likely to be found. The fourth filter reduced FindMod false positives by considering whether modifications found by FindMod were unambiguous or ambiguous. An unambiguous modification had only one FindMod match against one query peptide mass (Additional File 1), an ambiguous modification had more than one match against a query mass (Additional File 1). For the peptide to be included in the final set of methylated peptides, at least one peptide in the overlapping peptides had to be an unambiguous peptide match. The use of these 4 filters resulted in 169 high confidence methylated peptides, from 17,471 initial low confidence matches (Figure 1).

Figure 1
figure 1

Finding high confidence modified peptides using 5 filtering strategies. By removing peptides that matched to both methylated peptides and unmodified peptides (filter 1) and peptides which contains D or E residues (filter 2), 2,641 peptides were left. By filtering for overlapping peptides (filter 3), and where at least one of the overlapping peptide is an unambiguous peptide match (filter 4), 163 peptides remained. These included 108 unambiguous peptide matches, 43 ambiguous peptide matches, and 12 peptides in both categories. Found in these peptides are 45 lysine methylation sites and 38 arginine methylation sites, all of which are of high confidence (filter 5).

Figure 2
figure 2

Filter 2: using two or more overlapping peptides to improve modification confidence. A peptide can have no missed cleavage or one missed cleavage (MC) at either the N-terminus or C-terminus of the peptide. The modification site has to be found in two or more overlapping peptides for it to be accepted for subsequent analyses. There are a few scenarios where these modification sites can be found. a) There can be one peptide with no missed cleavage overlapping with another peptide with one missed cleavage. b) The modification site can be found in two overlapping peptides each with one missed cleavage. c) In the third scenario, the modification site may be found in three peptides, one peptide with no missed cleavage, and two peptides with one missed cleavage each. In all of the above cases, at least one peptide in the overlapping peptides has to be an unambiguous peptide match. Cases where this was not seen were excluded from all subsequent analyses.

While overlapping peptides helped localise methylation sites to one or more peptides, they did not necessarily localise the methylation to one amino acid. To address this, we used a fifth filter. When two or more modified peptides that passed filters 1-4 were also found to overlap and share the same modification site, the modification was classified as high confidence and kept. Note that any results for lysine trimethylation were discarded from the study since it is near-isobaric to lysine acetylation. From this filtering process, we found 40 lysine-methylated proteins with 45 lysine methylation sites: 25 with mono- and 20 with di-methylation. Similarly, we found 31 arginine-methylated proteins with 38 arginine methylation sites: 20 with mono- and 18 with di-methylation. There were 5 proteins that contained both arginine and lysine methylation. The list of high confidence methylated proteins and methylation sites are shown in Table 1, additional information on these high confidence methylated peptides and methylation sites are shown in Additional file 2 and Additional file 3 correspondingly.

Table 1 List of methylated proteins and methylation sites discovered by FindMod.

Confirmation of FindMod protein methylation

To establish the accuracy of our methylation discovery approach, we theoretically digested all known methylated proteins in Swiss-Prot and analysed the resulting peptides with our FindMod approach. We supplemented this with a larger set of theoretically methylated proteins. The average true positive rate for FindMod at 0.04 Da was 89%. For methylation sites in Swiss-Prot, FindMod had a true positive rate of 100% for monomethyl-K, 98% for dimethyl-K, and 76% for dimethyl-R (Table 2a). The true positive rate for monomethyl-R could not be accurately estimated since the number of test cases was insufficient for accurate evaluation. Similarly, the true positive rate for the artificial methylation set was 78% for monomethyl-K, 89% for dimethyl-K, and 90% for both monomethyl-R and dimethyl-R (Table 2b). Additional results for the evaluation of the true positive rate of FindMod are shown in Additional file 4.

Table 2 True positive rate at mass tolerance of 0.04 Da.

To further assess the accuracy of the FindMod approach, methylation sites discovered by FindMod were cross-referenced with known methylation sites in the literature and databases. Whilst only a small number of proteins are documented as methylated in the literature, we confirmed 3 proteins (Ssb1p, Ssb2p, Tub2p) as methylated (Table 3). If we included methyl-lysine sites in peptides containing D and E, we also confirmed the methylation of Tef1p and Rpl23p. This included 3 lysine methylation sites (K30, K79, and K390) from Tef1p, and 1 lysine dimethylation site (K110) from Rpl23p [14] (Additional file 5). Furthermore, we found 15 methylated ribosomal proteins in S. cerevisiae, consistent with the presence of methylation sites in ribosomal proteins of eukaryotes, such as S. cerevisiae [1215, 40], S. pombe [41], A. thaliana [42], and human [4346].

Table 3 Methylated proteins identified independently by both FindMod and described in the literature.

Discovery rate of methylated peptides, unmodified peptides, and lysine and arginine-methylated residues

The discovery rate of a peptide is the frequency of protein identifications in which a particular peptide is observed. Methylated peptides with low discovery rates are likely to be sub-stoichiometric and partially methylated. It was predicted that there should be many more unmodified peptides than methylated peptides, and that methylated peptides will have a lower discovery rate since they are likely to be sub-stoichiometric. The discovery rate of high confidence methylated peptides was found to be significantly lower than that of unmodified peptides (p < 0.0001). The median discovery rate for unmodified peptides was 0.50, and the median value for arginine and lysine methylated peptides was 0.03. To check that the lower discovery rate of methylated residues was not due to differences in peptide ionisation efficiency, we examined if there was a correlation between the discovery rates of methylated and unmodified residues. In the set of results, there were 69 methylated residues for which the corresponding unmodified residues were also seen. The discovery rate of methylated residues was significantly but weakly correlated with the discovery rate of matching unmodified residues (Kendall's τ = 0.22, p < 0.01), consistent with expected. A list of methylated proteins and the methylation sites discovered by FindMod is shown in Table 1. The discovery rate of all high confidence methylated peptides and methylation sites are shown in Additional file 2 and Additional file 3 correspondingly.

Biological function, sub-cellular localization, abundance and half-life of methylated proteins

Methylated proteins are known to be involved in several pathways, such as translation [11] and RNA processing [49]. To investigate the function of the methylated proteins from yeast, gene ontology (GO) annotations for all yeast methylated proteins from FindMod analysis and Swiss-Prot were compared to non-methylated yeast proteins (Table 4). It was found that a number of biological processes were enriched with very high statistical significance, specifically translation, ribosome biogenesis and assembly, RNA metabolic process, and organelle organization and biogenesis. The molecular function of structural activity, RNA binding, and translation regulator activity were also significantly enriched. As may be expected from the above, methylated proteins were significantly enriched in the cellular components of the ribosome and cytoplasm.

Table 4 Methylated proteins from yeast are enriched in specific processes, functions and components.

Protein abundance data from Ghaemmaghami et al. (2003) [50] was used to compare the abundance of methylated proteins to non-methylated proteins. This was used to determine if lower abundance proteins, more likely to be involved in signal transduction and regulation [51], are methylated. Methylated proteins were found to have a higher median abundance of 11500, as compared to non-methylated proteins, which had a median abundance of 2220 (p < 0.0001, Figure 3a). Despite this, several methylated proteins of low abundance were seen including 5 proteins of less than 1000 molecules per cell. These included Snf2p (217 copies/cell), Snu114p (300 copies/cell), Mrpl20p (358 copies/cell), and Rpl3p (450 copies/cell). Examples of proteins with high abundance are Rp1Bp (265,000 copies/cell) and Tdh3p (169,000 copies/cell).

Figure 3
figure 3

Distribution of abundance of methylated proteins and half-life of lysine-methylated proteins, versus non-methylated proteins. a) The x-axis represents protein abundance, in copies per cell, as log base 10. The abundance of methylated proteins is represented with a solid line, while the abundance of non-methylated proteins is represented with a dotted line. b) The x-axis represents protein half-life, minutes, in log base 10. The half-life of lysine-methylated proteins is represented with a solid line, while the half-life of non-methylated proteins is represented with a dotted line. The arrow points to a group of proteins with very short half-life, seen only in 'other' proteins, which are likely to be unmethylated. Note that in both figures, 'other' proteins are those for which methylation was not found; this group may, however, contain some methylated proteins. Note also that abundance data and half-life data was not available for all yeast proteins in Belle et al. (2006) [53] and Ghaemmaghami et al. (2003) [50].

The methylation of lysine residues has been suggested to block their ubiquitination, leading to a longer protein half-life [52]. To investigate this possibility, protein half-life data from Belle et al. (2006) [53] was used to compare the half-life of lysine-methylated proteins to non-methylated proteins. Interestingly, we found methylated proteins had a longer median half-life of 66 minutes, as compared to 43 minutes for non-methylated proteins (p = 0.012, Figure 3b). A striking difference between the methylated and non-methylated proteins was the absence of a group of proteins with very short half-life (see arrow in Figure 3b). Despite this, our approach also identified 18 methylated proteins with half-life less than 60 minutes. Examples of methylated proteins with shorter half-lives are Rrp5p (15 minutes), Ski3p (32 minutes), and Snu114p (52 minutes). Examples of proteins with long half-life are Utp22p (13,266 minutes) and Atp2p (6,627 minutes), although we note that these numbers may be erroneous estimations in the Belle et al. study (2006) [53]. Although the abundance and half-lives of methylated proteins could be analysed more precisely by comparing methylated proteins to other proteins from the same GO slim biological process, this approach was limited by the relatively small number of methylated proteins (66 proteins) in the dataset. Methylated proteins mapped to 33 gene ontology biological process categories, with an average of 2 proteins per category, which was unsuitable for appropriate statistical analyses.

Interplay of methylation and other post-translational modifications

To see if lysine methylation might block ubiquitination, the Ubipred software [54] was used to predict if known methylated lysine sites are also subject to ubiquitination. The Ubipred software has an accuracy of 84.4% and is thus sufficiently reliable for this test. It was found that 43% of high-confidence lysine methylation sites were also predicted to be ubiquitination sites. This result lends support to the hypothesis that methylation might block ubiquitination, potentially prolonging the half-life of lysine-methylated proteins.

It has recently been reported that the methylation of arginine can regulate the phosphorylation (or dephosphorylation) of some proteins [5561]. To investigate whether there is evidence of interplay between arginine methylation and phosphorylation in S. cerevisiae, we examined the proportion of arginine-methylated proteins that are known to be phosphorylated in databases and in the literature. It was found that 94% (30/32) of arginine-methylated yeast proteins are known to be phosphorylated. This is a considerable increase over the 38% (2,548/6,709) of all S. cerevisiae proteins known to be phosphorylated and suggests a possible interplay of arginine methylation and phosphorylation [5561].

Arginine and lysine methylation motifs

To determine if methylation sites are enriched in specific sequence-motifs, all yeast methylation sites from FindMod analysis and the Swiss-Prot database were analysed to find enriched sequence-motifs. Methionine was found to be at position -1 from lysine methylation in 5 FindMod sites and two additional methylation sites previously documented in S. cerevisiae (Table 5). This presence of methylation was of very high statistical significance (p = 1.18 × 10-6) as compared to that expected in any random sequence of yeast proteins. By contrast, residues found to be significantly enriched adjacent to arginine methylation included W at position -4 (p = 1.50 × 10-7), and G at position -3 (p = 6.08 × 10-6). While it was previously known that arginine methylation is found in RGG motifs, Wooderchak et al. (2008) [62] showed that arginine methylation is also found in RXG and RGX motifs. No known S. cerevisiae methylation sites documented in Swiss-Prot contained the R GG, R GX, or R XG-motifs. However, FindMod found 7 methylation sites with the R XG or R GX motifs. Two methylation sites, Tdh3p dimethyl-R11 and Rpl4Ap dimethyl-R84, matched to the GXXR XG motif, which conforms with the known R XG motif and the additional GXXR motif found in this study. Three methylation sites had the novel WXXXR motif.

Table 5 Lysine and arginine methylation motifs.

Discussion

Large-scale discovery of lysine and arginine methylation sites

In this study, 45 lysine methylation sites and 38 arginine methylation sites were identified in 66 proteins in the S. cerevisiae proteome. These include 4 proteins previously known to be methylated in yeast or in other organisms and 15 proteins that are functionally related to others known to be methylated. Our findings support earlier studies [3133] that suggested methylation to be quite widespread. Whilst many of our methylation sites are novel and have not been confirmed by MS-MS, the filters and replicate analyses we used in association with the FindMod tool provided a robust means by which protein methylation could be detected. The false positive rate was estimated to be 11% at 0.04 Da mass error. Notwithstanding this, it should be noted that whilst we did study 2,607 proteins from yeast, this is only ~40% of the total yeast proteome. Therefore, we expect that up to 60% of methylated proteins would have been missed. Further methylation sites may have been missed due to difficulties in mass spectrometric detection; an example is methylarginine, which is often found in arginine- and glycine-rich regions that produce tryptic peptides that are too small for routine MALDI-TOF analysis.

Discovery rates may reflect the sub-stoichiometric nature of methylation

Previous research has highlighted that methylated peptides are difficult to discover [32] and this is made more difficult because methylation is sub-stoichiometric [34]. For example, sub-stoichiometric levels of methylations were observed in the human heterogeneous nuclear ribonucleoprotein K (hnRNP K), in which < 33% of hnRNP K were asymmetrically dimethylated at R303, and < 10% were monomethylated at R287 [56]. Our results from FindMod analysis support these observations since the proportion of methylated peptides seen for any protein was very low. The sub-stoichiometric nature of methylation events was also supported by a weak but significant correlation between the discovery rates of modified and unmodified paired peptides. However, there may be explanations, other than biological, for the lower discovery rate of modified peptides. These included inefficient trypsin cleavage which occurs C-terminal to methylated lysine and arginine residues [32] and differences in MALDI-ToF ionisation of the methylated peptides as seen with different proteotypic peptides [63].

Methylated proteins are involved in specific biological functions and processes, are higher in abundance and have longer half-life

Methylated proteins were found to be enriched for specific biological processes, molecular functions and sub-cellular localizations. Firstly, methylated proteins were enriched in translation, ribosome biogenesis and assembly. This is consistent with previous studies in which methylated proteins have been linked to translation in Escherichia coli, S. cerevisiae, and Schizosaccharomyces pombe [11]. Ribosomal proteins are also known to show lysine or arginine methylation, for example the ribosomal proteins L10a, L12, and L26a of Arabidopsis [42]. Secondly, the methylated proteins described here were found to be involved in RNA metabolic processes and are involved in RNA binding. This is consistent with the function of several proteins known to be methylated at RG-rich motifs [49]. The methylation of arginine in RG-rich motifs is conserved in human, and their RNA binding activity is also conserved [32]. One such example is the fragile X mental retardation protein (FMRP) [25]. Thirdly, our methylated proteins were enriched in the ribosome and the cytoplasm. This is consistent with the sites of translation and association with RNA inside the cell [22, 23]. Whilst the lack of methylated proteins enriched in the nucleus and nucleolus was not expected, these may have arisen due to our reduced set of proteins for analysis (40% of the yeast proteome). In addition, nuclear proteins such as histone and Npl3p are known to have peptides with multiple modification sites but these were not searched for in this study. Methylated proteins found in this study were significantly higher in abundance than proteins currently known to be non-methylated. This is partly explained by ribosomal proteins and proteins involved in translation, some of which we found to be methylated, being of very high abundance [50, 64]. Methylated proteins were also found to be of longer average half-life. This may be due to their role in translation [11], where ribosomal proteins are generally stable [53].

Interplay of methylation and other post-translational modifications

The methylation of lysine is known to block the action of ubiquitin ligase [65], preventing proteins from degradation via the ubiquitin/proteasome system [52, 66]. Our observation of a distinct group of low half-life proteins in S. cerevisiae, none of which were methylated, suggests that lysine methylation might be on many proteins and prevent their ubiquitination. The limited number of ubiquitination sites currently known on yeast proteins [67, 68] makes it currently difficult to check if lysine methylation, as found in this study, is found on residues that can also be poly-ubiquitinated. However, our prediction of putative ubiquitination sites [54] showed that 43% of the lysine methylation sites in 40 proteins may be ubiquitinated.

Several studies suggested that there is interplay between arginine methylation and phosphorylation of some proteins [5561]. Arginine methylation may antagonise phosphorylation [56, 57], act as a switch to enable the binding of phosphatase to encourage dephosphorylation [60], or encourage phosphorylation [59]. On the other hand, phosphorylation can either interfere with arginine methylation [58, 61], or promote the recruitment of arginine methyltransferase [55]. We found that the majority of arginine-methylated proteins in our study (30 out of 32 or 94%) are known from the literature to be phosphorylated, suggesting an interplay between arginine methylation and phosphorylation in these proteins. However these arginine methylation and phosphorylation sites were not necessarily directly adjacent in the protein sequence.

Arginine and lysine methylation motifs

Motif analysis showed that many methylation sites described here conform with previously known motifs. For example, 7 arginine methylation sites discovered by FindMod conformed with the known R XG and R GX motifs [62]. Arginine methylation sites were also enriched in GXXR motifs, which correlated with the enrichment of glycine residues nearby arginine methylation sites [69]. In addition, two experimentally verified methylated sites in Pfk2p and Rpl23Ap annotated in Swiss-Prot along with 5 FindMod sites suggests the existence of a MK lysine methylation motif. The discovery of the novel enriched methylation motif WXXXR supports the possibility that there are more methylation sites to be found in S. cerevisiae. These also raise an interesting question concerning which motifs are methylated by specific methyltransferases. Methyltransferases responsible for most methylation sites are also unknown (e.g. Tef1p K30, Pfk2p K180), and the function of several methyltransferase proteins in S. cerevisiae remain poorly characterized [13]. Therefore, more experiments are required to elucidate the function of methylation in S. cerevisiae.

Conclusions

This study is a step towards the definition of the methyl proteome of S. cerevisiae. It will be useful to guide future experiments on its predominance and role in the cell. For example, experiments are needed to elucidate the function of methylation and how each site is regulated, which with the exception of histone methylation is largely unknown. Secondly, experiments to investigate whether methylation sites overlap with poly-ubiquitination sites, and therefore prevent protein degradation via the ubiquitin/proteasome pathway could be undertaken. Thirdly, it will be important to understand whether the functions of methylated proteins are co-regulated by ubiquitination, phosphorylation or other post-translational modifications. Finally, the ultimate goal in studying methylation should be to build networks of methylated proteins, their interaction partners and modifying enzymes to elucidate their dynamics as a system, similar to previous work on protein phosphorylation [7072].

Methods

MALDI-ToF mass spectra for S. cerevisiae

This study employed MALDI-ToF peptide mass fingerprinting spectra from the large-scale characterization of protein complexes in S. cerevisiae [38]. There were 36,854 peptide mass spectra containing 1.2 million empirical masses, with an average mass error of 0.02 Da. These were from 2,607 proteins out of ~6,500 proteins (40%) in the yeast proteome, whereby each protein had an average of 11 spectra or at least 3 spectra. Peptide masses corresponding to unmodified peptides or tryptic peptides of porcine trypsin were removed, as were peptides less than 500 Da.

Tailor-made mass tolerance for each empirical spectrum

An error threshold was calculated for each of the 36,854 spectra; this was possible as the identity of all proteins was known. For each spectrum, the mass differences between the empirical and theoretical mass of all known unmodified peptides were calculated. The average and median mass tolerance was 0.04 Da. To ensure high accuracy of methylation discovery, only spectra with a mass error (Additional file 6) that was lower than 0.1 Da were used for the identification of methylation sites.

FindMod analysis of yeast proteins

Each peptide mass spectra was analysed with FindMod [35]. A bulk submission web interface to FindMod was developed http://ca.expasy.org/tools/findmod/findmod_batch.html. Each FindMod query used the UniProt accession number for the protein identified through peptide mass fingerprinting (from Gavin et al., 2006) [38], the experimental peptide masses for this protein and the tailor-made mass tolerance in Da. Other FindMod parameters included the use of monoisotopic mass, a maximum of 1 missed cleavage by trypsin, no amino acid substitutions, that the peptides were M+H+ and could contain oxidised methionine or tryptophan. The peptide masses were matched to theoretical peptides generated from the precursor sequence. The program searched for 71 types of post-translational modifications in all experimental peptide masses [35], including mono-, di-, and tri-methylation http://www.expasy.ch/tools/findmod/findmod_masses.html. Matches to 6 types of modifications were removed from the analyses, as they are not found in S. cerevisiae or may lead to many false positives due to their low mass; for more details see Additional file 6. The Swiss-Prot database version 51.6 and TrEMBL version 34.6 [73] were used for the FindMod matches.

Filters to remove low quality methylation sites

For the methylated peptides to be included in the analysis, they needed to pass the following 5 filters. The peptides 1) cannot be an unmodified peptide, 2) had to contain no Asp or Glu residues, and 3) have no or one missed tryptic cleavage. In addition, 4) the peptide must have two or more overlapping peptides and at least one peptide in the overlapping peptides had to be an unambiguous peptide match. 5) When two or more modified peptides that passed filters 1-4 were also found to overlap and share the same modification site, the modification was classified as high confidence and kept. The use of overlapping peptides to improve the reliability of methylation site is facilitated by methylation sites found at the C-terminus of peptides. Trypsin cleavage at methylated arginine and lysine has been observed in many LC-MS/MS experiments [32, 7477], and is less efficient than at non-methylated residues. A list of tryptic peptides with C-terminal methylated amino acids, identified by LC-MS-MS, is shown in Additional file 7.

Calculation of discovery rate

The discovery rate for an unmodified peptide was calculated as the fraction of protein identifications in which the unmodified peptide is observed. In the case of duplicated genes, the counts of protein identifications were summed together because peptide mass fingerprinting cannot distinguish between proteins that do not differ in primary sequence. The discovery rate for a particular unmodified residue in the protein was calculated as the sum of the discovery rate of all the unmodified peptides that contain the residue. Discovery rates were also calculated for modified methylated peptides and methylated residues using the method as described above. Partially methylated peptides are likely to have a low discovery rate. While mass spectra with a maximum mass tolerance of 0.1 Da were used for finding the methylation sites to limit the false positive rate, all available mass spectra with a mass tolerance of up to 1.5 Da were used for the calculation of discovery rate. That is because more mass spectra were needed to increase the sample size for discovery rate calculation.

Evaluation of the true positive and false positive rate

Swiss-Prot entries with known lysine and arginine methylation sites were obtained from Swiss-Shop http://au.expasy.org/swiss-shop/, for Swiss-Prot release 57.2 [73], by searching the MOD_RES field using the keywords 'methyllysine' and 'methylarginine'. S. cerevisiae proteins sequences were downloaded from Swiss-Prot by using the query 'organism:4932'. The annotation of known methylation sites were obtained from the MOD_RES field of the Swiss-Prot entry, and type of methylation were determined from the standard RESID nomenclature [78]. The proteins were processed into mature forms where appropriate; these contain no signal peptides, propeptides, intein regions, and only consists of protein chains annotated by the 'CHAIN' field of the Swiss-Prot entry. For each M or W residues in a peptide, the mass of methionine and tryptophan oxidation was added to the total mass of the peptide. Only methylated peptides with a maximum of one-missed cleavage and with masses between 500 and 3,000 Da were used. Since lysine trimethylation is near-isobaric to lysine acetylation, trimethylation was not included in the analysis. Two in silico test sets, the known methylation set and the artificial methylation set, were used to evaluate the true positive rate of FindMod for the discovery of mono- and di-methylation on arginine and lysine residues. The known methylation test set contained known lysine and arginine methylation sites from Swiss-Prot. The set of sequences from which methylation sites were found was non-redundant at the 90% identity level, generated using UniRef90 [79]. This test set included 883 known mono- and di-methylation sites. The artificial mono- and di-methylation sites on lysine and arginine residues were generated by simulated methylation on theoretical unmodified peptides. The artificial test set has more data than the known methylation test set, to allow more accurate estimation of the true positive rate. Approximately 6% of lysine residues from S. cerevisiae protein sequences were randomly sampled to generate artificially methylated peptides for monomethyl-K. The sampling procedure was repeated for dimethyl-K, monomethyl-R, and dimethyl-R. The second test set was referred to as the artificial methylation set, and contained 36,594 artificial mono- and di-methylation sites.

The true positive rate of FindMod, with the 5 filters described above, was evaluated using known methylation sites and artificial methylation sites. Removal of peptides containing D or E residues were not required since no artifactual methylation on D or E residues were introduced to the in silico test sets. The true positive rate was evaluated at the mass tolerance of 0.04 Da, since this was the median mass tolerance all empirical for peptide masses [38]. For each test set, a true positive FindMod match requires the residue, sequence position and the type of methylation to be correctly matched. The true positive rate of FindMod, was calculated as the number of true positive matches divided by the sum of the number of false positive matches and the number of true positives, represented as a percentage.

Arginine and lysine methylation motif analysis

Ten amino acid residues N-terminal and C-terminal to each methylation site were included in the motif analysis. The number of times each amino acid occurs at each of these positions was counted. For any methylation site less than 10 residues from the N- or C-terminus of the protein, positions beyond the limit of the sequence were disregarded. To measure whether an amino acid was significantly enriched at each position, a p-value was calculated using the prop.test function in the R statistical package. A one-sided statistical test was used, with an alternative hypothesis that there was an enrichment of amino acid frequency over the average frequency. Bonferroni's correction was used to correct the p-value calculated by prop.test to reduce false positives.

Functional analysis and statistical tests

Functional data co-analysed with modifications were protein abundance [50], protein half-life [53], Gene Ontology (GO) slim (from Saccharomyces Genome Database, ftp://ftp.yeastgenome.org/yeast/) [80] and protein complexes [81]. Nonparametric tests were used for all statistical analyses. Protein abundance data, in copies per cell, was from Ghaemmaghammi et al. (2003) [50]. Protein half-life data, in minutes, was from Belle et al. (2006) [53]. To investigate if lysine methylation might block ubiquitination, the Ubipred software [54] was used to predict if known methylated lysine sites are also subject to ubiquitination. To investigate if arginine methylated proteins were co-regulated by phosphorylation, the Swiss-Prot database release 57.2 [73] was examined to see if methylated proteins also had experimentally determined protein threonine, serine, and tyrosine phosphorylation sites. Mann-Whitney tests, a non-parametric substitute for Student's t-test, were used to compare between two samples. Kendall's correlation coefficient, a non-parametric substitute for Pearson's correlation coefficient, was used to measure the significance of the correlation between two samples. GO slim term enrichment was assessed using Fisher's exact test and Bonferroni correction [82]. All statistical analyses were performed using the R statistical package version 2.2.1 [83].