Abstract
We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.
Chapter PDF
Similar content being viewed by others
Keywords
- Mutual Information
- False Negative Rate
- Sample Size Determination
- Generalise Likelihood Ratio Test
- Sanity Check
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley Series in Probability and Statistics. Wiley-Interscience (2013)
Bacciu, D., Etchells, T., Lisboa, P., Whittaker, J.: Efficient identification of independence networks using mutual information. Computational Statistics 28(2), 621–646 (2013)
Blanchard, G., Lee, G., Scott, C.: Semi-Supervised Novelty Detection. Jour. of Mach. Learn. Res. 11 (March 2010)
Brown, G., Pocock, A., Zhao, M., Lujan, M.: Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Jour. of Mach. Learn. Res. 13, 27–66 (2012)
Calvo, B., Larrañaga, P., Lozano, J.: Learning Bayesian classifiers from positive and unlabeled examples. Patt. Rec. Letters 28, 2375–2384 (2007)
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic (1988)
Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Machine Learning 15(2), 201–221 (1994)
Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998)
Denis, F., Laurent, A., Gilleron, R., Tommasi, M.: Text classification and co-training from positive and unlabeled examples. In: International Conf. on Machine Learning, Workshop: The Continuum from Labeled to Unlabeled Data (2003)
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008)
Ellis, P.: The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Camb. Univ. Press (2010)
Gretton, A., Györfi, L.: Consistent nonparametric tests of independence. The Journal of Machine Learning Research 99, 1391–1423 (2010)
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature Extraction: Foundations and Applications. Springer-Verlag New York, Inc., Secaucus (2006)
Hahn, G., Shapiro, S.: Statistical Models in Engineering. Wiley Series on Systems Engineering and Analysis Series. John Wiley & Sons (1967)
Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: International Conf. on Machine Learning, pp. 387–394 (2002)
Nielsen, F.G., Kooyman, M., Kensche, P., Marks, H., Stunnenberg, H., Huynen, M., et al.: The pinkthing for analysing chip profiling data in their genomic context. BMC Research Notes 6(1), 133 (2013)
Paninski, L.: Estimation of entropy and mutual information. Neural Computation 15(6), 1191–1253 (2003)
Sokal, R., Rohlf, F.: Biometry: The principles and practice of Statistics in Biological data, 3rd edn. W. H. Freeman & Co (1995)
Yu, H., Han, J., Chang, K.: PEBL: positive example based learning for web page classification using svm. In: SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sechidis, K., Calvo, B., Brown, G. (2014). Statistical Hypothesis Testing in Positive Unlabelled Data. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44845-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-44845-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44844-1
Online ISBN: 978-3-662-44845-8
eBook Packages: Computer ScienceComputer Science (R0)