Abstract
This paper presents a new method of measuring performance when positives are rare and investigates whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). Performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives.
References
P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. Massachusetts Institute of Technology Press, Cambridge, Massachusetts, 1998. 306
R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, 1998. 306
D.B. Emmert, P.J. Stoehr, G. Stoesser, and G.N Camerson. The European Bioinformatics Institute (EBI) databases. Nucleic Acids Research, 22(17):3445–3449, 1994. 301
R.S. Michalski and J. Wnek. Guest editors’ introduction. Machine Learning: Second Special Issue on Multistrategy Learning, 27:205–208, June 1997. 301
S. Muggleton. Inverse entailment and Progol. New Generation Computing, 13:245–286, 1995. 307
S. Muggleton, R. King, and M. Sternberg. Protein secondary structure prediction using logic-based machine learning. Protein Engineering, 5(7):647–657, 1992. 307
H. Nielsen, J. Engelbrecht, S. Brunak, and G. vonHeijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering, 10(1):1–6, 1997. 308
J.R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA, 1993. 307
D.B. Searls. Linguistic approaches to biological sequences [review]. Computer Applications in the Biosciences, 13:333–344, August 1997. 306
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Muggleton, S.H., Bryant, C.H., Srinivasan, A. (2000). Measuring Performance when Positives Are Rare: Relative Advantage versus Predictive Accuracy — A Biological Case-Study. In: López de Mántaras, R., Plaza, E. (eds) Machine Learning: ECML 2000. ECML 2000. Lecture Notes in Computer Science(), vol 1810. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45164-1_32
Download citation
DOI: https://doi.org/10.1007/3-540-45164-1_32
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67602-7
Online ISBN: 978-3-540-45164-8
eBook Packages: Springer Book Archive