Statistical Hypothesis Testing in Positive Unlabelled Data

  • Konstantinos Sechidis
  • Borja Calvo
  • Gavin Brown
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8726)

Abstract

We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley Series in Probability and Statistics. Wiley-Interscience (2013)Google Scholar
  2. 2.
    Bacciu, D., Etchells, T., Lisboa, P., Whittaker, J.: Efficient identification of independence networks using mutual information. Computational Statistics 28(2), 621–646 (2013)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Blanchard, G., Lee, G., Scott, C.: Semi-Supervised Novelty Detection. Jour. of Mach. Learn. Res. 11 (March 2010)Google Scholar
  4. 4.
    Brown, G., Pocock, A., Zhao, M., Lujan, M.: Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Jour. of Mach. Learn. Res. 13, 27–66 (2012)MATHMathSciNetGoogle Scholar
  5. 5.
    Calvo, B., Larrañaga, P., Lozano, J.: Learning Bayesian classifiers from positive and unlabeled examples. Patt. Rec. Letters 28, 2375–2384 (2007)CrossRefGoogle Scholar
  6. 6.
    Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic (1988)Google Scholar
  7. 7.
    Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Machine Learning 15(2), 201–221 (1994)Google Scholar
  8. 8.
    Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Denis, F., Laurent, A., Gilleron, R., Tommasi, M.: Text classification and co-training from positive and unlabeled examples. In: International Conf. on Machine Learning, Workshop: The Continuum from Labeled to Unlabeled Data (2003)Google Scholar
  10. 10.
    Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008)Google Scholar
  11. 11.
    Ellis, P.: The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Camb. Univ. Press (2010)Google Scholar
  12. 12.
    Gretton, A., Györfi, L.: Consistent nonparametric tests of independence. The Journal of Machine Learning Research 99, 1391–1423 (2010)Google Scholar
  13. 13.
    Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature Extraction: Foundations and Applications. Springer-Verlag New York, Inc., Secaucus (2006)CrossRefGoogle Scholar
  14. 14.
    Hahn, G., Shapiro, S.: Statistical Models in Engineering. Wiley Series on Systems Engineering and Analysis Series. John Wiley & Sons (1967)Google Scholar
  15. 15.
    Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: International Conf. on Machine Learning, pp. 387–394 (2002)Google Scholar
  16. 16.
    Nielsen, F.G., Kooyman, M., Kensche, P., Marks, H., Stunnenberg, H., Huynen, M., et al.: The pinkthing for analysing chip profiling data in their genomic context. BMC Research Notes 6(1), 133 (2013)CrossRefGoogle Scholar
  17. 17.
    Paninski, L.: Estimation of entropy and mutual information. Neural Computation 15(6), 1191–1253 (2003)CrossRefMATHGoogle Scholar
  18. 18.
    Sokal, R., Rohlf, F.: Biometry: The principles and practice of Statistics in Biological data, 3rd edn. W. H. Freeman & Co (1995)Google Scholar
  19. 19.
    Yu, H., Han, J., Chang, K.: PEBL: positive example based learning for web page classification using svm. In: SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Konstantinos Sechidis
    • 1
  • Borja Calvo
    • 2
  • Gavin Brown
    • 1
  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK
  2. 2.Department of Computer Science and Artificial IntelligenceUniversity of the Basque CountrySpain

Personalised recommendations