Advertisement

Learning to Filter Junk E-Mail from Positive and Unlabeled Examples

  • Karl-Michael Schneider
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248)

Abstract

We study the applicability of partially supervised text classification to junk mail filtering, where a given set of junk messages serve as positive examples while the messages received by a user are unlabeled examples, but there are no negative examples. Supplying a junk mail filter with a large set of junk mails could result in an algorithm that learns to filter junk mail without user intervention and thus would significantly improve the usability of an e-mail client. We study several learning algorithms that take care of the unlabeled examples in different ways and present experimental results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the AAAI Workshop, Madison Wisconsin, pp. 98–95. AAAI Press, Menlo Park (1998), Technical Report WS- 98-05Google Scholar
  2. 2.
    Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10, 1048–1054 (1999)CrossRefGoogle Scholar
  3. 3.
    Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In: Zaragoza, H., Gallinari, P., Rajman, M. (eds.) Proc. Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France, pp. 1–13 (2000)Google Scholar
  4. 4.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977)zbMATHMathSciNetGoogle Scholar
  5. 5.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)zbMATHCrossRefGoogle Scholar
  6. 6.
    Denis, F., Gilleron, R., Tommasi, M.: Text classification from positive and unlabeled examples. In: Proc. 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002), pp. 1927–1934 (2002)Google Scholar
  7. 7.
    Liu, B., Lee, W.S., Yu, P.S., Li, X.: Partially supervised classification of text documents. In: Proc. 19th International Conference on Machine Learning (ICML 2002), pp. 387–394 (2002)Google Scholar
  8. 8.
    Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: 18th International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (2003)Google Scholar
  10. 10.
    Yu, H., Han, J., Chang, K.C.C.: PEBL: Positive example based learning for web page classification using SVM. In: Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 239–248. ACM Press, New York (2002)CrossRefGoogle Scholar
  11. 11.
    Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001)CrossRefGoogle Scholar
  12. 12.
    Chai, K.M.A., Ng, H.T., Chieu, H.L.: Bayesian online classifiers for text classification and filtering. In: Proc. 25th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), pp. 97–104 (2002)Google Scholar
  13. 13.
    McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Learning for Text Categorization: Papers from the AAAI Workshop, pp. 41–48. AAAI Press, Menlo Park (1998), Technical Report WS-98-05Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Karl-Michael Schneider
    • 1
  1. 1.Department of General LinguisticsUniversity of Passau 

Personalised recommendations