Background

The nucleolus is a sub-nuclear cellular compartment that is accessible to a large number of proteins since it is not surrounded by a membrane. To date, over 4500 distinct human proteins have been identified from purified nucleoli [1]. The most well-characterized function of the nucleolus is the biogenesis of ribosomes [2]. However, nucleolar proteins are diverse and dynamic, reflecting the central role of this compartment in the cell through its involvement in numerous other key cellular processes and in the cellular response to changing conditions [37]. Indeed, many proteins have been found to localize cyclically or conditionally to the nucleolus [3, 4, 7, 8].

Although such a large and dynamic volume of cellular traffic likely requires extensive regulation, proteins are often proposed to localize to the nucleolus simply through high-affinity binding to core nucleolar components [6, 9]. Despite this, numerous disparate reports of short nucleolar targeting sequences in proteins have been published over the past 20 years. Many of these sequences can localize non-nucleolar reporter proteins to the nucleolus when fused to them. In an effort to catalogue and systematically characterize these Nucleolar Localization Sequences (NoLSs), we have recently curated the literature and assembled a human NoLS dataset which we subsequently used to train an artificial neural network computational predictor [10]. The predictor considers the protein sequence and JPred predictions of protein secondary structure [11]. When applied to the entire human proteome, it identified thousands of candidate NoLSs, ten of which were experimentally tested and confirmed to target the nucleolus [10].

Here, we describe NoD, a web server and a command-line program that provides computer predictions of NoLSs in proteins. We also investigate the application of the human-trained predictor in other eukaryotic and viral organisms, demonstrating that NoD can give effective NoLS predictions in a wide variety of species.

Implementation

The NoD web server provides an easy way to predict NoLSs within a protein sequence. NoD predictions are obtained by entering a protein sequence in fasta format on the NoD webserver http://www.compbio.dundee.ac.uk/nod. Protein sequences are encoded as previously described [10]. Briefly, sliding windows of size 13 are sparsely encoded in a binary format using a reduced alphabet of size 12 for submission to an artificial neural network (ANN). The current implementation of NoD uses a local version of Batchman from the Stuttgart Neural Network Simulator [12] and the human-trained NoLS prediction model developed previously [10] to provide the prediction for each encoded subsequence. The Batchman output is then processed and NoLSs are predicted if the average score output by the ANN of 8 consecutive windows is at least 0.8 [10]. Finally, the prediction is displayed as shown in Figure 1 if at least one NoLS is identified. Otherwise, the user is informed that no NoLS is predicted in the input protein. As shown in Figure 1, for proteins predicted to contain NoLS(s), the output consists of 3 sections:

Figure 1
figure 1

Example of NoLS prediction returned by NoD. If at least one NoLS is predicted in a protein, NoD returns an output page that displays the sequence and position of the predicted NoLSs, the full-length protein sequence as entered by the user with the NoLSs in red and a graph showing the average NoLS prediction score for every 20-residue window in the protein. The region shown in pink in this graph is the NoLS candidate segment region and represents the range of scores within which a 20-residue segment is predicted to be a NoLS.

  • the sequences of the predicted NoLS(s) are first enumerated

  • the full-length protein sequence is displayed with the predicted NoLS(s) shown in red

  • finally, a graph is presented of the NoLS window-based score [10] as a function of position in the protein sequence.

The NoLS window-based score graph can be useful to guide experimental design of nucleolar targeting. The graph gives an overview of the entire protein and shows the proportion of the protein with putative nucleolar targeting capabilities as well as regions of the protein that are near the cut-off threshold and therefore almost predicted as NoLSs.

When entering a protein sequence, the user is provided with the option of also running JPred secondary structure predictions [11] to include as input to the NoLS neural network. If JPred is selected, the accuracy of prediction is slightly higher [10] but the computation time is increased.

For users who want predictions for whole proteomes there is a command line version of NoD called clinod. Clinod produces the same results as a web server but it is more suitable for processing of multiple sequences and is convenient to use within software pipelines.

Clinod requires Java 6 and the Batchman executable from the Stuttgart Neural Network Simulator [12] to run. Clinod accepts the list of FASTA formatted sequences from an input file and outputs the predictions to a file or the console. By default the following output is produced for each sequence-the name of the sequence, the number of NoLSs predicted, the start and the end positions and the sequences of each predicted NoLS. However, for better integration with other bioinformatics tools, many more output options are supported. For example, the input sequences can be cleaned (stripped of ambiguous characters), and output along with the prediction results and sequences with no predicted NoLS can be omitted from the output. Various output options are described in Table 1 but for a detailed description of the clinod switches please refer to Additional file 1.

Table 1 Clinod output formats

Finally, for users preferring to run and visualize their predictions locally, there is a virtual appliance version of NoD, which can easily be deployed on a variety of operating systems by a non-specialist user. The virtual appliance version of NoD offers the same functionality as our public server, with the exception of JPred predictions. However, in the near future we intend to release a version which will support JPred.

Results and Discussion

Prediction of NoLSs in non-human eukaryotes

Because more NoLSs have been reported in human than in all other organisms combined, the NoLS predictor was originally trained and tested only on human sequences [10]. More precisely, as described previously [10], the predictor was trained on a manually curated positive set of 46 human experimentally validated NoLSs and a negative set consisting of several hundred human proteins chosen because they are believed not to localize to the nucleolus. After training, ten of the NoLS predictions were chosen for experimental validation and all were confirmed as positives [10].

However, the prediction of NoLSs is relevant in all eukaryotes and in particular in their viruses, many of which encode proteins that localize to the nucleolus of their host cells [13]. To investigate whether the human-trained predictor can be applied to other organisms, we searched the literature to find examples of NoLSs that have been experimentally identified in other organisms. In total, we collated 31 eukaryotic or viral NoLSs (including 6 human NoLSs that had not been used for training or testing previously) which are listed in Table 2, along with the position of the experimentally determined NoLSs. Sequences were filtered to remove redundancy within this dataset and redundancy with the original training set as described previously [10]. The full-length sequences of these NoLS-containing proteins were then passed through the NoLS predictor. As with the original NoLS paper [10], only experimentally validated NoLSs of length less than 50 residues were considered for testing. This focuses the testing on those NoLSs that have been most confidently identified by experiment and reduces the likelihood that we are dealing with signal patches (ie signals formed from residues distant in the primary sequence but that come into close proximity in the folded molecule). We considered NoLSs as correctly-predicted if the region of overlap between the predicted NoLS and the experimentally determined NoLS covered at least 60% of the shortest of the two molecules. In many cases, the predicted NoLS region is entirely contained within the experimentally determined NoLS or vice versa. Details of the predictions, including the position of predicted NoLSs, are given in Table 2 and a summary of the prediction accuracy is given in Table 3.

Table 2 Detail of NoD predictions on the multi-organism testing dataset assembled
Table 3 Accuracy of NoD predictions in all organisms investigated

As shown in Table 3, mammalian NoLSs and their viral counterparts are well predicted, with sensitivity and positive predictive values ranging from 0.75 to 1.0. This is not surprising because of the close evolutionary distance between humans and other mammals and the constant adaptation of their viruses. Amongst the non-mammalian proteins considered, the Dictyostelium discoideum protein investigated has two NoLSs, one of which is well-predicted. The NoLS that was not correctly identified consists of only 7 amino acids and is likely too short for the predictor to find. The two mollusc NoLSs are entirely well-predicted but low numbers of examples in this group of organisms prevents robust conclusions. Similarly, plant and plant-infecting virus NoLSs are generally well-predicted but also suffer from small numbers of examples. However, the human-trained predictor is entirely incapable of identifying the NoLSs experimentally detected in trypanosomes. This agrees well with experiments in which the NoLS of a Trypanosome brucei protein, ESAG8, was fused to a fluorescent reporter protein and tested for nucleolar localization in human cells. With an intact trypanosome NoLS, the fusion protein was found to be nuclear but not nucleolar in human cells [14]. This observation and our predictions suggest that nucleolar targeting mechanisms differ significantly between humans and trypanosomes and are not interchangeable. Although a larger sample would be needed to confirm this difference, trypanosomal nucleolar targeting mechanisms might represent good drug targets.

While no experimentally generated negative dataset is available for testing the predictor in non-human organisms, we note that the non-NoLS regions of NoLS-containing proteins provide such a set. For example, the human adenovirus 2 protein VII encodes three basic regions at positions 47-54, 93-112 and 127-141 which represent possible nuclear/nucleolar localization sequences [15]. Deletion constructs demonstrate that only the 93-112 segment targets a reporter protein to the nucleolus [15]. This segment matches very closely the NoD NoLS predictions (see Table 2), providing not only an accurate test example but also perfect negative controls (the two other basic regions are not predicted as NoLSs). Thus, the positive predictive values shown in Table 3 give an indication of the false positive rate of prediction in different organisms. However, while some false positives probably represent prediction errors, others might have occurred because NoLSs were not experimentally mapped with enough precision or more than one NoLS exists in the protein. Larger experimental datasets will undoubtedly help to clarify this problem.

Of the 31 eukaryotic and viral NoLSs considered for independent testing, 22 were correctly identified, resulting in an overall sensitivity of 71%. In addition, 6 non-NoLS regions were also identified as positives (and thus are considered here as false positives) yielding an overall positive predictive value of 79%. Finally, of the 27 proteins considered, 6 were predicted to encode NoLSs in regions not experimentally shown to harbour a NoLS resulting in a specificity of 78% (although we note that some of these false positives might represent NoLSs that have yet to be experimentally identified).

Conclusions

NoD is currently the only predictor capable of providing and visualizing NoLS predictions for protein sequences.

The web server takes a protein sequence as input and returns the positions and the sequences of the predicted NoLSs. It also draws a graph of the predicted scores for each residue of the sequence.

The command line NoD takes the list of FASTA formatted protein sequences as an input, and for each sequence outputs the number of detected NoLSs, their positions in the full-length sequence and their sequences. However, the command line predictor output is highly customisable and can be adjusted to the user's needs. Finally, the virtual appliance version of the predictor allows the deployment and running of the predictor locally, eliminating data privacy issues.

Cross-species testing shows NoD to perform best for mammalian and mammalian-infecting viral proteins but preliminary results suggest sequences from molluscs, amoebae, plants and their viruses are also well-predicted.

Availability and requirements

  • Project name: NoD (N ucleo lar localization sequence D etector)

  • Project home page: http://www.compbio.dundee.ac.uk/nod

  • Operating system(s): Platform independent

  • Programming language: Java

  • Other requirements: The command line version requires Java 6 or higher, and the SNNS Batch Interpreter V1.0 [12]. The virtual appliance version requires freely available VMware Player 3.1 [16] or higher, commercial VMware Fusion (for Mac users) [17] or the freely available Oracle VirtualBox v3.2.12 [18]

  • License: Apache 2

  • Any restrictions to use by non-academics: no restrictions

Acknowledgements and Funding

We would like to thank Dr Tom Walsh for technical expertise. This work was supported by a post-doctoral fellowship from the Caledonian Research Foundation to MSS, by the Scottish Universities Life Sciences Alliance (SULSA), by the European Network of Excellence ENFIN [contract LSHG-CT-2005-518254], and by Wellcome Trust grant WT083481.