Skip to main content
Log in

Online active multi-field learning for efficient email spam filtering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47

    Article  Google Scholar 

  2. Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labeling. In: ICDM 2008 Proceedings of the 8th IEEE international conference on data mining, pp 869–874

  3. Wozniak M (2010) A hybrid decision tree training method using data streams. Knowl Inf Syst, Online FirstTM, 05 Oct 2010

  4. Chang M, Yih W, Meek C (2008) Partitioned logistic regression for spam filtering. In: SIGKDD 2008 Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–105

  5. Lee C-H (2010) Learning to combine discriminative classifiers: confidence based. In: SIGKDD 2010 Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 743–752

  6. Cormack GV, Lynam T (2005) TREC 2005 spam track overview. In: TREC2005 Proceedings of the 14th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–266

  7. Tong S, Koller D (2002) Support vector machine active learning with applications to text classification. J Mach Learn Res 2: 45–66

    MATH  Google Scholar 

  8. Cesa-Bianchi N, Gentile C, Zaniboni L (2006) Worst-case analysis of selective sampling for linear classification. J Mach Learn Res 7: 1205–1230

    MathSciNet  MATH  Google Scholar 

  9. Chai KMA, Chieu HL, Tou H (2002) Bayesian online classifiers for text classification and filtering. In: SIGIR’02 Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 97–104

  10. Sculley D, Wachman GM (2007) Relaxed online SVMs for spam filtering. In: SIGIR’07 Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 415–422

  11. Cormack GV (2007) University of waterloo participation in the TREC 2007 spam track. In TREC2007: Notebook of the 16th text retrieval conference, National Institute of Standards and Technology

  12. Cormack GV (2007) TREC 2007 spam track overview. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274

  13. Verikas A, Guzaitis J, Gelzinis A, Bacauskiene M (2010) A general framework for designing a fuzzy rule-based classifier. Knowl Inf Syst, Online FirstTM, 16 Sept 2010

  14. Yoo S, Yang Y, Lin F, Moon I-C (2009) Mining social networks for personalized email prioritization. In: SIGKDD 2009 Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 967–976

  15. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391

    Article  Google Scholar 

  16. Liu W, Wang T (2010) Multi-field learning for email spam filtering. In: SIGIR’10 Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, pp 745–746

  17. Cormack GV (2006) TREC 2006 spam track overview. In: TREC2006 Proceedings of the 15th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–272

  18. Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: CEAS2007 Proceedings of the 4th conference on email and anti-spam

  19. Goodman J, Yih W (2006) Online discriminative spam filter training. In: CEAS2006 Proceedings of the 3rd conference on email and anti-spam

  20. Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054

    Article  Google Scholar 

  21. Sanchez F, Duan Z, Dong Y (2010) Understanding forgery properties of spam delivery paths. In: CEAS2010 Proceedings of the 7th annual collaboration, electronic messaging, anti-abuse and spam conference. http://ceas.cc/2010/papers/Paper%2012.pdf

  22. Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94 Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp 3–12

  23. Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the 11th International Conference on Machine Learning, pp 48–156

  24. Malik HH, Fradkin D, Moerchen F (2010) Single pass text classification by direct feature weighting. Knowl Inf Syst, Online FirstTM, 25 June 2010

  25. Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Comput Surv 38(2):Article 6

  26. Graham P (2002) A plan for spam. http://www.paulgraham.com/spam.html

  27. Graham P (2003) Better bayesian filtering. http://www.paulgraham.com/better.html, In the 2003 Spam Conference

  28. Sculley D, Wachman GM (2007) Relaxed online SVMs in the TREC spam filtering track. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274

  29. Cormack GV (2008) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4): 335–455

    Article  MathSciNet  Google Scholar 

  30. Kato M, Langeway J, Wu Y, Yerazunis WS (2007) Three non-bayesian methods of spam filtration: CRM114 at TREC 2007. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274

  31. Dietterich TG (2000) Ensemble methods in machine learning. In: MCS2000 Proceedings of the multiple classifier systems, pp 1–15

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wuying Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Wang, T. Online active multi-field learning for efficient email spam filtering. Knowl Inf Syst 33, 117–136 (2012). https://doi.org/10.1007/s10115-011-0461-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0461-x

Keywords

Navigation