Statistical Hypothesis Testing in Positive Unlabelled Data

Sechidis, Konstantinos; Calvo, Borja; Brown, Gavin

doi:10.1007/978-3-662-44845-8_5

Konstantinos Sechidis²³,
Borja Calvo²⁴ &
Gavin Brown²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8726))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2930 Accesses
2 Citations

Abstract

We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.

Download to read the full chapter text

Chapter PDF

Simple strategies for semi-supervised feature selection

Article Open access 17 July 2017

Bayesian Hypothesis Testing in Machine Learning

Statistical Learning Theory in Practice

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley Series in Probability and Statistics. Wiley-Interscience (2013)
Google Scholar
Bacciu, D., Etchells, T., Lisboa, P., Whittaker, J.: Efficient identification of independence networks using mutual information. Computational Statistics 28(2), 621–646 (2013)
Article MathSciNet Google Scholar
Blanchard, G., Lee, G., Scott, C.: Semi-Supervised Novelty Detection. Jour. of Mach. Learn. Res. 11 (March 2010)
Google Scholar
Brown, G., Pocock, A., Zhao, M., Lujan, M.: Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Jour. of Mach. Learn. Res. 13, 27–66 (2012)
MATH MathSciNet Google Scholar
Calvo, B., Larrañaga, P., Lozano, J.: Learning Bayesian classifiers from positive and unlabeled examples. Patt. Rec. Letters 28, 2375–2384 (2007)
Article Google Scholar
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic (1988)
Google Scholar
Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Machine Learning 15(2), 201–221 (1994)
Google Scholar
Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998)
Chapter Google Scholar
Denis, F., Laurent, A., Gilleron, R., Tommasi, M.: Text classification and co-training from positive and unlabeled examples. In: International Conf. on Machine Learning, Workshop: The Continuum from Labeled to Unlabeled Data (2003)
Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008)
Google Scholar
Ellis, P.: The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Camb. Univ. Press (2010)
Google Scholar
Gretton, A., Györfi, L.: Consistent nonparametric tests of independence. The Journal of Machine Learning Research 99, 1391–1423 (2010)
Google Scholar
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature Extraction: Foundations and Applications. Springer-Verlag New York, Inc., Secaucus (2006)
Book Google Scholar
Hahn, G., Shapiro, S.: Statistical Models in Engineering. Wiley Series on Systems Engineering and Analysis Series. John Wiley & Sons (1967)
Google Scholar
Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: International Conf. on Machine Learning, pp. 387–394 (2002)
Google Scholar
Nielsen, F.G., Kooyman, M., Kensche, P., Marks, H., Stunnenberg, H., Huynen, M., et al.: The pinkthing for analysing chip profiling data in their genomic context. BMC Research Notes 6(1), 133 (2013)
Article Google Scholar
Paninski, L.: Estimation of entropy and mutual information. Neural Computation 15(6), 1191–1253 (2003)
Article MATH Google Scholar
Sokal, R., Rohlf, F.: Biometry: The principles and practice of Statistics in Biological data, 3rd edn. W. H. Freeman & Co (1995)
Google Scholar
Yu, H., Han, J., Chang, K.: PEBL: positive example based learning for web page classification using svm. In: SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Manchester, M13 9PL, UK
Konstantinos Sechidis & Gavin Brown
Department of Computer Science and Artificial Intelligence, University of the Basque Country, Spain
Borja Calvo

Authors

Konstantinos Sechidis
View author publications
You can also search for this author in PubMed Google Scholar
Borja Calvo
View author publications
You can also search for this author in PubMed Google Scholar
Gavin Brown
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Applied Sciences, Department of Computer and Decision Engineering, Université Libre de Bruxelles, Av. F. Roosevelt, CP 165/15, 1050, Brussels, Belgium
Toon Calders
Dipartimento di Informatica, Università degli Studi “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Department of Computer Science, Universität Paderborn, Warburger Str. 100, 33098, Paderborn, Germany
Eyke Hüllermeier
Dipartimento di Informatica,, Università degli Studi di Torino, Corso Svizzera 185, 10149, Torino, Italy
Rosa Meo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sechidis, K., Calvo, B., Brown, G. (2014). Statistical Hypothesis Testing in Positive Unlabelled Data. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44845-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-44845-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44844-1
Online ISBN: 978-3-662-44845-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Statistical Hypothesis Testing in Positive Unlabelled Data

Abstract

Chapter PDF

Similar content being viewed by others

Simple strategies for semi-supervised feature selection

Bayesian Hypothesis Testing in Machine Learning

Statistical Learning Theory in Practice

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Statistical Hypothesis Testing in Positive Unlabelled Data

Abstract

Chapter PDF

Similar content being viewed by others

Simple strategies for semi-supervised feature selection

Bayesian Hypothesis Testing in Machine Learning

Statistical Learning Theory in Practice

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation