Advertisement

Learning from Positive and Unlabeled Examples with Different Data Distributions

  • Xiao-Li Li
  • Bing Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)

Abstract

We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set P and the positive examples in the unlabeled set U are generated from the same distribution. This assumption may be violated in practice. For example, one wants to collect all printer pages from the Web. One can use the printer pages from one site as the set P of positive pages and use product pages from another site as U. One wants to classify the pages in U into printer pages and non-printer pages. Although printer pages from the two sites have many similarities, they can also be quite different because different sites often present similar products in different styles and have different focuses. In such cases, existing methods perform poorly. This paper proposes a novel technique A-EM to deal with the problem. Experiment results with product page classification demonstrate the effectiveness of the proposed technique.

References

  1. 1.
    Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: ICML 2002 (2002)Google Scholar
  2. 2.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998 (1998)Google Scholar
  3. 3.
    Bockhorst, J., Craven, M.: Exploiting relations among concepts to acquire weakly labeled training data. In: ICML 2002 (2002)Google Scholar
  4. 4.
    Crammer, K., Chechik, G.: A needle in a haystack: local one-class optimization. In: ICML 2004 (2004)Google Scholar
  5. 5.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (1977)Google Scholar
  6. 6.
    Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  7. 7.
    Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: ICML 2000 (2000)Google Scholar
  8. 8.
    Koppel, M., Schler, J.: Authorship Verification as a one-class classification problem. In: ICML 2004 (2004)Google Scholar
  9. 9.
    Lee, W., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: ICML 2003 (2003)Google Scholar
  10. 10.
    Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: SIGIR 1994 (1994)Google Scholar
  11. 11.
    Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: IJCAI 2003 (2003)Google Scholar
  12. 12.
    Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: ICML 2002 (2002)Google Scholar
  13. 13.
    Liu, B., Dai, Y., Li, X., Lee, W., Yu, P.: Building text classifiers using positive and unlabeled examples. In: ICDM 2003 (2003)Google Scholar
  14. 14.
    McCallum, A.: Multi-label text classification with a mixture model trained by EM. In: AAAI 1999 Workshop on Text Learning (1999)Google Scholar
  15. 15.
    Muggleton, S.: Learning from the positive data. Machine Learning (2001)Google Scholar
  16. 16.
    Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning (2000)Google Scholar
  17. 17.
    Rocchio, J.: Relevant feedback in information retrieval. In: Salton, G. (ed.) The smart retrieval system: experiments in auto-matic document processing, Englewood Cliffs (1971)Google Scholar
  18. 18.
    Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-training: learning probabilistic mappings between topics. In: KDD 2003 (2003)Google Scholar
  19. 19.
    Scholkopf, B., Platt, J., Shawe, J., Smola, A., Williamson, R.: 1999. Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87, Microsoft Research (1999)Google Scholar
  20. 20.
    Vapnik, V.: The nature of statistical learning theory (1995)Google Scholar
  21. 21.
    Wu, P., Dietterich, T.: Improving SVM accuracy by training on auxiliary data sources. In: ICML 2004 (2004)Google Scholar
  22. 22.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR 1999 (1999)Google Scholar
  23. 23.
    Yu, H., Han, J., Chang, K.: PEBL: Positive example based learning for Web page classification using SVM. In: KDD 2002 (2002)Google Scholar
  24. 24.
    Yu, H.: General MC: Estimating boundary of positive class from small positive data. In: ICDM 2003 (2003)Google Scholar
  25. 25.
    Zelikovitz, S., Hirsh, H.: Improving short text classification using unlabeled background knowledge to assess document similarity. In: ICML 2000 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Xiao-Li Li
    • 1
  • Bing Liu
    • 2
  1. 1.Institute for Infocomm ResearchSingapore
  2. 2.Department of Computer ScienceUniversity of Illinois at ChicagoChicago

Personalised recommendations