Data Mining in Proteomics with Learning Classifier Systems

  • Jaume Bacardit
  • Michael Stout
  • Jonathan D. Hirst
  • Natalio Krasnogor
Part of the Studies in Computational Intelligence book series (SCI, volume 125)


The era of data mining has provided renewed effort in the research of certain areas of biology that for their difficulty and lack of knowledge were and are still considered unsolved problems. One such problem, which is one of the fundamental open problems in computational biology is the prediction of the 3D structure of proteins, or protein structure prediction (PSP). The human experts, with the crucial help of data mining tools, are learning how protein fold to form their structure, but are still far from providing perfect models for all kinds of proteins. Data mining and knowledge discovery are totally necessary in order to advance in the understanding of the folding process. In this context, Learning Classifier Systems (LCS) are very competitive tools. They have shown in the past their competence in many different data mining tasks. Moreover, they provide human-readable solutions to the experts that can help them understand the PSP problem. In this chapter we describe our recent efforts in applying LCS to PSP related domains. Specifically, we focus in a relevant PSP subproblem, called Coordination Number (CN) prediction. CN is a kind of simplified profile of the 3D structure of a protein. Two kinds of experiments are described, the first of them analyzing different ways to represent the basic composition of proteins, its primary sequence, and the second one assessing different data sources and problem definition methods for performing competent CN prediction. In all the experiments LCS show their competence in terms of both accurate predictions and explanatory power.


Protein Chain Minimum Description Length Protein Structure Prediction Input Attribute Amino Acid Type 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Holland, J.H., Reitman, J.S.: Cognitive systems based on adaptive algorithms. In Hayes-Roth, D., Waterman, F., eds.: Pattern-Directed Inference Systems. Academic, New York (1978) 313–329Google Scholar
  2. 2.
    Smith, S.: A Learning System Based on Genetic Algorithms. PhD thesis, University of Pittsburgh, Pittsburgh (1980)Google Scholar
  3. 3.
    Bernadó, E., Llorà, X., Garrell, J.M.: XCS and GALE: a comparative study of two learning classifier systems with six other learning algorithms on classification tasks. In: Fourth International Workshop on Learning Classifier Systems – IWLCS-2001. (2001) 337–341Google Scholar
  4. 4.
    Bacardit, J., Butz, M.V.: Data mining in learning classifier systems: comparing xcs with gassist. In: Advances at the frontier of Learning Classifier Systems. Springer, Berlin Heidelberg New York (2007) 282–290Google Scholar
  5. 5.
    Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3 (1995) 149–175CrossRefGoogle Scholar
  6. 6.
    Llorà, X., Garrell, J.M.: Knowledge-independent data mining with fine-grained parallel evolutionary algorithms. In: Proceedings of the Third Genetic and Evolutionary Computation Conference. Morgan Kaufmann, San Francisco (2001) 461–468Google Scholar
  7. 7.
    Bacardit, J.: Pittsburgh Genetics-Based Machine Learning in the Data Mining Era: Representations, Generalization, and Run-Time. PhD thesis, Ramon Llull University, Barcelona, Spain (2004)Google Scholar
  8. 8.
    Stout, M., Bacardit, J., Hirst, J.D., Krasnogor, N., Blazewicz, J.: From HP lattice models to real proteins: coordination number prediction using learning classifier systems. In: Applications of Evolutionary Computing, EvoWorkshops 2006, Springer, Berlin Heidelberg New York, LNCS 3907 (2006) 208–220CrossRefGoogle Scholar
  9. 9.
    Bacardit, J., Stout, M., Krasnogor, N., Hirst, J.D., Blazewicz, J.: Coordination number prediction using learning classifier systems: performance and interpretability. In: GECCO’06: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation. ACM Press, New York (2006) 247–254Google Scholar
  10. 10.
    Stout, M., Bacardit, J., Hirst, J.D., Krasnogor, N.: Prediction of residue exposure and contact number for simplified hp lattice model proteins using learning classifier systems. In Ruan, D., D’hondt, P., Fantoni, P.F., Cock, M.D., Nachtegael, M., Kerre, E.E., eds.: Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence. World Scientific, Genova (2006) 601–608Google Scholar
  11. 11.
    Hinds, D.A., Levitt, M.: A lattice model for protein-structure prediction at low resolution. Proceedings of the National Academy Sciences of the United States of America 89 (1992) 2536–2540CrossRefGoogle Scholar
  12. 12.
    Yue, K., Fiebig, K.M., Thomas, P.D., Sun, C.H., Shakhnovich, E.I., Dill, K.A.: A test of lattice protein folding algorithms. Proceedings of the National Academy Sciences of the United States of America 92 (1995) 325–329CrossRefGoogle Scholar
  13. 13.
    Kinjo, A.R., Horimoto, K., Nishikawa, K.: Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins 58 (2005) 158–165CrossRefGoogle Scholar
  14. 14.
    Baldi, P., Pollastri, G.: The principled design of large-scale recursive neural network architectures dag-rnns and the protein structure prediction problem. Journal of Machine Learning Research 4 (2003) 575–602CrossRefGoogle Scholar
  15. 15.
    Shao, Y., Bystroff, C.: Predicting interresidue contacts using templates and pathways. Proteins 53 (2003) 497–502CrossRefGoogle Scholar
  16. 16.
    MacCallum, R.: Striped sheets and protein contact prediction. Bioinformatics 20 (2004) I224–I231CrossRefGoogle Scholar
  17. 17.
    Zhao, Y., Karypis, G.: Prediction of contact maps using support vector machines. In: Proceedings of the IEEE Symposium on BioInformatics and BioEngineering (2003) 26–36Google Scholar
  18. 18.
    Altschul, S.F., Madden, T.L., Scher, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research 25 (1997) 3389–3402CrossRefGoogle Scholar
  19. 19.
    Abe, H., Go, N.: Noninteracting local-structure model of folding and unfolding transition in globular proteins. Part 2. Application to two-dimensional lattice proteins. Biopolymers 20 (1981) 1013–1031Google Scholar
  20. 20.
    Hart, W.E., Istrail, S.: Crystallographical universal approximability: a complexity theory of protein folding algorithms on crystal lattices. Technical Report SAND95-1294, Sandia National Labs, Albuquerque (1995)Google Scholar
  21. 21.
    Hart, W., Istrail, S.: Robust proofs of NP-hardness for protein folding: general lattices and energy potentials. Journal of Computational Biology (1997) 1–20Google Scholar
  22. 22.
    Escuela, G., Ochoa, G., Krasnogor, N.: Evolving l-systems to capture protein structure native conformations. In: Proceedings of the 8th European Conference on Genetic Programming (EuroGP 2005), Lecture Notes in Computer Sciences 3447, pp. 73–84, Springer, Berlin Heidelberg New York (2005)Google Scholar
  23. 23.
    Krasnogor, N., Pelta, D.: Fuzzy memes in multimeme algorithms: a fuzzy-evolutionary hybrid. In Verdegay, J., ed.: Fuzzy Sets based Heuristics for Optimization. Springer, Berlin Heidelberg New York (2002)Google Scholar
  24. 24.
    Krasnogor, N., Hart, W., Smith, J., Pelta, D.: Protein structure prediction with evolutionary algorithms. In Banzhaf, W., Daida, J., Eiben, A., Garzon, M., Honavar, V., Jakaiela, M., Smith, R., eds.: GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann (1999)Google Scholar
  25. 25.
    Krasnogor, N., Blackburne, B., Burke, E., Hirst, J.: Multimeme algorithms for protein structure prediction. In: Proceedings of the Parallel Problem Solving from Nature VII. Lecture Notes in Computer Science. Volume 2439 (2002) 769–778Google Scholar
  26. 26.
    Krasnogor, N., de la Cananl, E., Pelta, D., Marcos, D., Risi, W.: Encoding and crossover mismatch in a molecular design problem. In Bentley, P., ed.: AID98: Proceedings of the Workshop on Artificial Intelligence in Design 1998 (1998)Google Scholar
  27. 27.
    Krasnogor, N., Pelta, D., Marcos, D., Risi, W.: Protein structure prediction as a complex adaptive system. In: Proceedings of Frontiers in Evolutionary Algorithms 1998 (1998)Google Scholar
  28. 28.
    DeJong, K.A., Spears, W.M., Gordon, D.F.: Using genetic algorithms for concept learning. Machine Learning 13 (1993) 161–188CrossRefGoogle Scholar
  29. 29.
    Bacardit, J.: Analysis of the initialization stage of a pittsburgh approach learning classifier system. In: GECCO 2005: Proceedings of the Genetic and Evolutionary Computation Conference. Volume 2., ACM Press, New York (2005) 1843–1850Google Scholar
  30. 30.
    Rissanen, J.: Modeling by shortest data description. Automatica 14 (1978) 465–471zbMATHCrossRefGoogle Scholar
  31. 31.
    Bacardit, J., Goldberg, D., Butz, M., Llorà, X., Garrell, J.M.: Speeding-up pittsburgh learning classifier systems: Modeling time and accuracy. In: Parallel Problem Solving from Nature - PPSN 2004, Springer, Berlin Heidelberg New York, LNCS 3242 (2004) 1021–1031Google Scholar
  32. 32.
    Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123–140zbMATHMathSciNetGoogle Scholar
  33. 33.
    Bacardit, J., Krasnogor, N.: Empirical evaluation of ensemble techniques for a pittsburgh learning classifier system. In: Proceedings of the 9th International Workshop on Learning Classifier Systems. (to appear), LNAI, Springer (2007)Google Scholar
  34. 34.
    Blake, C., Keogh, E., Merz, C.: UCI repository of machine learning databases (1998) (
  35. 35.
    Liu, H., Hussain, F., Tam, C.L., Dash, M.: Discretization: an enabling technique. Data Mining and Knowledge Discovery 6 (2002) 393–423CrossRefMathSciNetGoogle Scholar
  36. 36.
    Noguchi, T., Matsuda, H., Akiyama, Y.: Pdb-reprdb: a database of representative protein chains from the protein data bank (pdb). Nucleic Acids Research 29 (2001) 219–220CrossRefGoogle Scholar
  37. 37.
    Sander, C., Schneider, R.: Database of homology-derived protein structures. Proteins 9 (1991) 56–68CrossRefGoogle Scholar
  38. 38.
    Broome, B., Hecht, M.: Nature disfavors sequences of alternating polar and non-polar amino acids: implications for amyloidogenesis. Journal of Molecular Biology 296 (2000) 961–968CrossRefGoogle Scholar
  39. 39.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)Google Scholar
  40. 40.
    John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Mateo (1995) 338–345Google Scholar
  41. 41.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann (2000)Google Scholar
  42. 42.
    Miller, R.G.: Simultaneous Statistical Inference. Springer, Berlin Heidelberg New York (1981)zbMATHGoogle Scholar
  43. 43.
    Jones, D.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292 (1999) 195–202CrossRefGoogle Scholar
  44. 44.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. Department of Computer Science and Information Engineering, National Taiwan University. (2001) Software available at cjlin/libsvm.
  45. 45.
    Booker, L.: Recombination distribution for genetic algorithms. In: Foundations of Genetic Algorithms 2. Morgan Kaufmann (1993) 29–44Google Scholar
  46. 46.
    Livingstone, C.D., Barton, G.J.: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Computer Applications in the Biosciences 9 (1993) 745–756Google Scholar
  47. 47.
    Bacardit, J., Stout, M., Hirst, J.D., Sastry, K., Llorà, X., Krasnogor, N.: Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In: GECCO’07: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation. ACM Press, New York (2007) to appearGoogle Scholar
  48. 48.
    Harik, G.: Linkage learning via probabilistic modeling in the ecga. Technical Report 99010, Illinois Genetic Algorithms Lab, University of Illinois at Urbana-Champaign (1999)Google Scholar
  49. 49.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (1991)zbMATHCrossRefGoogle Scholar
  50. 50.
    Bacardit, J., Krasnogor, N.: Biohel: bioinformatics-oriented hierarchical evolutionary learning. Nottingham eprints, University of Nottingham (2006)Google Scholar
  51. 51.
    Venturini, G.: Sia: a supervised inductive algorithm with genetic search for learning attributes based concepts. In Brazdil, P.B., ed.: Machine Learning: ECML-93 - Proceedings of the European Conference on Machine Learning. Springer, Berlin Heidelberg New York (1993) 280–296Google Scholar
  52. 52.
    Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E., Krasnogor, N.: Prediction of topological contacts in proteins using learning classifier systems. Soft Computing (2007) Special Issue on Evolutionary and Metaheuristic–based Data Mining (EMBDM), to appearGoogle Scholar
  53. 53.
    Preparata, F.P.: Computational geometry : an introduction/Franco P. Preparata, Michael Ian Shamos. Texts and monographs in computer science. Springer (1985)Google Scholar
  54. 54.
    Butz, M.V., Lanzi, P.L., Wilson, S.W.: Hyper-ellipsoidal conditions in xcs: rotation, linear approximation, and solution structure. In: GECCO’06: Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM Press, New York (2006) 1457–1464Google Scholar
  55. 55.
    O’Hara, T., Bull, L.: Backpropagation in accuracy-based neural learning classifier systems. In: Advances at the frontier of Learning Classifier Systems. Springer, Berlin Heidelberg New York (2007) 25–39CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jaume Bacardit
    • 1
  • Michael Stout
    • 1
  • Jonathan D. Hirst
    • 2
  • Natalio Krasnogor
    • 1
  1. 1.Automated Scheduling, Optimization and Planning research group, School of Computer Science and ITUniversity of NottinghamUK
  2. 2.School of ChemistryUniversity of NottinghamUniversity ParkUK

Personalised recommendations