Evolving Regular Expression-Based Sequence Classifiers for Protein Nuclear Localisation

  • Amine Heddad
  • Markus Brameier
  • Robert M. MacCallum
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3005)


A number of bioinformatics tools use regular expression (RE) matching to locate protein or DNA sequence motifs that have been discovered by researchers in the laboratory. For example, patterns representing nuclear localisation signals (NLSs) are used to predict nuclear localisation. NLSs are not yet well understood, and so the set of currently known NLSs may be incomplete. Here we use genetic programming (GP) to generate RE-based classifiers for nuclear localisation. While the approach is a supervised one (with respect to protein location), it is unsupervised with respect to already-known NLSs. It therefore has the potential to discover new NLS motifs. We apply both tree-based and linear GP to the problem. The inclusion of predicted secondary structure in the input does not improve performance. Benchmarking shows that our majority classifiers are competitive with existing tools. The evolved REs are usually “NLS-like” and work is underway to analyse these for novelty.


Regular Expression Nucleic Acid Research Linear Genetic Programming Protein Nuclear Localisation Nuclear Localisation Signal Motif 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Research 25, 3389–3402 (1997)CrossRefGoogle Scholar
  2. 2.
    Bairoch, A., Apweller, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acid Research 25, 31–36 (1997)CrossRefGoogle Scholar
  3. 3.
    Bairoch, A., Bucher, P., Hofmann, K.: The PROSITE database, its status in 1997. Nucleic Acid Research 25, 217–221 (1997)CrossRefGoogle Scholar
  4. 4.
    Brameier, M., Banzhaf, W.: A comparison of linear genetic programming and neural networks in medical data mining. IEEE-EC 5, 17–26 (2001)Google Scholar
  5. 5.
    Christophe, D., Christophe-Hobertus, C., Pichon, B.: Nuclear targeting of proteins: how many different signals. CS 12(5), 337–341 (2000)Google Scholar
  6. 6.
    Cokol, M., Nair, R., Rost, B.: Finding nuclear localization signals. EMBO Rep 1(5), 411–415 (2000)CrossRefGoogle Scholar
  7. 7.
    Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 300(4), 1005–1016 (2000)CrossRefGoogle Scholar
  8. 8.
    Fontes, M.R.M., Teh, T., Jans, D., Brinkworth, R.I., Kobe, B.: Structural basis for the specificity of bipartite nuclear localization sequence binding by importin alpha. Journal of Biological Chemistry 278(30), 27981–27987 (2003)CrossRefGoogle Scholar
  9. 9.
    Hazel, P.: PCRE - Perl Compatible Regular Expressions library,
  10. 10.
    Howard, D., Benson, K.: Promoter prediction with a GP-automaton. In: Raidl, G.R., Cagnoni, S., Cardalda, J.J.R., Corne, D.W., Gottlieb, J., Guillot, A., Hart, E., Johnson, C.G., Marchiori, E., Meyer, J.-A., Middendorf, M. (eds.) EvoIASP 2003, EvoWorkshops 2003, EvoSTIM 2003, EvoROB/EvoRobot 2003, EvoCOP 2003, EvoBIO 2003, and EvoMUSART 2003. LNCS, vol. 2611, pp. 44–53. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  11. 11.
    Jensen, L.J., Gupta, R., Staerfeldt, H.-H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19(5), 635–642 (2003)CrossRefGoogle Scholar
  12. 12.
    Jonassen, J.F.: Collins, and D. G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science 4(8), 1587–1595 (1995)CrossRefGoogle Scholar
  13. 13.
    Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195–202 (1999)CrossRefGoogle Scholar
  14. 14.
    Koza, J.R.: Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, Cambridge (1992)zbMATHGoogle Scholar
  15. 15.
    Koza, J.R., Bennett III, F.H., Andre, D.: Using programmatic motifs and genetic programming to classify protein sequences as to cellular location. In: Porto, V.W., Waagen, D. (eds.) EP 1998. LNCS, vol. 1447, pp. 437–447. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  16. 16.
    Macara, G.: Transport into and out of the nucleus. Microbiology and Molecular Biology Reviews 65(4), 570–594 (2001)CrossRefGoogle Scholar
  17. 17.
    MacCallum, R.M.: Introducing a Perl Genetic Programming System: and Can Meta-evolution Solve the Bloat Problem? In: Ryan, C., Soule, T., Keijzer, M., Tsang, E.P.K., Poli, R., Costa, E. (eds.) EuroGP 2003. LNCS, vol. 2610, pp. 369–378. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  18. 18.
    Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochem. Biophys. Acta 405, 442–451 (1975)Google Scholar
  19. 19.
    Mulder, N.J., et al.: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acid Research 31(1), 315–318 (2003)CrossRefMathSciNetGoogle Scholar
  20. 20.
    Nair, R., Rost, B.: Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins: Structure, Function and Genetics 53(4), 917–930 (2003)CrossRefGoogle Scholar
  21. 21.
    Nakai, K., Horton, P.: PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Science 24(1), 34–36 (1999)CrossRefGoogle Scholar
  22. 22.
    Pearson, W.R.: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology 183, 63–98 (1990)CrossRefGoogle Scholar
  23. 23.
    Reinhardt, A., Hubbard, T.: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acid Research 26(9), 2230–2236 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Amine Heddad
    • 1
  • Markus Brameier
    • 1
  • Robert M. MacCallum
    • 1
  1. 1.Stockholm Bioinformatics Center Stockholm UniversityStockholmSweden

Personalised recommendations