Measuring Performance when Positives Are Rare: Relative Advantage versus Predictive Accuracy — A Biological Case-Study

  • Stephen H. Muggleton
  • Christopher H. Bryant
  • Ashwin Srinivasan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1810)


This paper presents a new method of measuring performance when positives are rare and investigates whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). Performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives.


  1. 1.
    P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. Massachusetts Institute of Technology Press, Cambridge, Massachusetts, 1998. 306Google Scholar
  2. 2.
    R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, 1998. 306zbMATHGoogle Scholar
  3. 3.
    D.B. Emmert, P.J. Stoehr, G. Stoesser, and G.N Camerson. The European Bioinformatics Institute (EBI) databases. Nucleic Acids Research, 22(17):3445–3449, 1994. 301CrossRefGoogle Scholar
  4. 4.
    R.S. Michalski and J. Wnek. Guest editors’ introduction. Machine Learning: Second Special Issue on Multistrategy Learning, 27:205–208, June 1997. 301Google Scholar
  5. 5.
    S. Muggleton. Inverse entailment and Progol. New Generation Computing, 13:245–286, 1995. 307CrossRefGoogle Scholar
  6. 6.
    S. Muggleton, R. King, and M. Sternberg. Protein secondary structure prediction using logic-based machine learning. Protein Engineering, 5(7):647–657, 1992. 307CrossRefGoogle Scholar
  7. 7.
    H. Nielsen, J. Engelbrecht, S. Brunak, and G. vonHeijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering, 10(1):1–6, 1997. 308CrossRefGoogle Scholar
  8. 8.
    J.R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA, 1993. 307Google Scholar
  9. 9.
    D.B. Searls. Linguistic approaches to biological sequences [review]. Computer Applications in the Biosciences, 13:333–344, August 1997. 306Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Stephen H. Muggleton
    • 1
  • Christopher H. Bryant
    • 1
  • Ashwin Srinivasan
    • 2
  1. 1.Computer ScienceUniversity of YorkYorkUK
  2. 2.Computing LaboratoryOxford UniversityOxfordUK

Personalised recommendations