Measuring Performance when Positives Are Rare: Relative Advantage versus Predictive Accuracy — A Biological Case-Study
This paper presents a new method of measuring performance when positives are rare and investigates whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). Performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. Predictive accuracy is not a good measure of performance for this domain because it does not discriminate well between NPP recognition models: despite covering varying numbers of (the rare) positives, all the models are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives.
- 1.P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. Massachusetts Institute of Technology Press, Cambridge, Massachusetts, 1998. 306Google Scholar
- 4.R.S. Michalski and J. Wnek. Guest editors’ introduction. Machine Learning: Second Special Issue on Multistrategy Learning, 27:205–208, June 1997. 301Google Scholar
- 8.J.R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA, 1993. 307Google Scholar
- 9.D.B. Searls. Linguistic approaches to biological sequences [review]. Computer Applications in the Biosciences, 13:333–344, August 1997. 306Google Scholar