Online active multi-field learning for efficient email spam filtering

Liu, Wuying; Wang, Ting

doi:10.1007/s10115-011-0461-x

Online active multi-field learning for efficient email spam filtering

Regular Paper
Published: 30 November 2011

Volume 33, pages 117–136, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wuying Liu^1,2 &
Ting Wang¹

424 Accesses
21 Citations
Explore all metrics

Abstract

Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Article Google Scholar
Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labeling. In: ICDM 2008 Proceedings of the 8th IEEE international conference on data mining, pp 869–874
Wozniak M (2010) A hybrid decision tree training method using data streams. Knowl Inf Syst, Online First^TM, 05 Oct 2010
Chang M, Yih W, Meek C (2008) Partitioned logistic regression for spam filtering. In: SIGKDD 2008 Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–105
Lee C-H (2010) Learning to combine discriminative classifiers: confidence based. In: SIGKDD 2010 Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 743–752
Cormack GV, Lynam T (2005) TREC 2005 spam track overview. In: TREC2005 Proceedings of the 14th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–266
Tong S, Koller D (2002) Support vector machine active learning with applications to text classification. J Mach Learn Res 2: 45–66
MATH Google Scholar
Cesa-Bianchi N, Gentile C, Zaniboni L (2006) Worst-case analysis of selective sampling for linear classification. J Mach Learn Res 7: 1205–1230
MathSciNet MATH Google Scholar
Chai KMA, Chieu HL, Tou H (2002) Bayesian online classifiers for text classification and filtering. In: SIGIR’02 Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 97–104
Sculley D, Wachman GM (2007) Relaxed online SVMs for spam filtering. In: SIGIR’07 Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 415–422
Cormack GV (2007) University of waterloo participation in the TREC 2007 spam track. In TREC2007: Notebook of the 16th text retrieval conference, National Institute of Standards and Technology
Cormack GV (2007) TREC 2007 spam track overview. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274
Verikas A, Guzaitis J, Gelzinis A, Bacauskiene M (2010) A general framework for designing a fuzzy rule-based classifier. Knowl Inf Syst, Online First^TM, 16 Sept 2010
Yoo S, Yang Y, Lin F, Moon I-C (2009) Mining social networks for personalized email prioritization. In: SIGKDD 2009 Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 967–976
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391
Article Google Scholar
Liu W, Wang T (2010) Multi-field learning for email spam filtering. In: SIGIR’10 Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, pp 745–746
Cormack GV (2006) TREC 2006 spam track overview. In: TREC2006 Proceedings of the 15th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–272
Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: CEAS2007 Proceedings of the 4th conference on email and anti-spam
Goodman J, Yih W (2006) Online discriminative spam filter training. In: CEAS2006 Proceedings of the 3rd conference on email and anti-spam
Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054
Article Google Scholar
Sanchez F, Duan Z, Dong Y (2010) Understanding forgery properties of spam delivery paths. In: CEAS2010 Proceedings of the 7th annual collaboration, electronic messaging, anti-abuse and spam conference. http://ceas.cc/2010/papers/Paper%2012.pdf
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94 Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp 3–12
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the 11th International Conference on Machine Learning, pp 48–156
Malik HH, Fradkin D, Moerchen F (2010) Single pass text classification by direct feature weighting. Knowl Inf Syst, Online First^TM, 25 June 2010
Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Comput Surv 38(2):Article 6
Graham P (2002) A plan for spam. http://www.paulgraham.com/spam.html
Graham P (2003) Better bayesian filtering. http://www.paulgraham.com/better.html, In the 2003 Spam Conference
Sculley D, Wachman GM (2007) Relaxed online SVMs in the TREC spam filtering track. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274
Cormack GV (2008) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4): 335–455
Article MathSciNet Google Scholar
Kato M, Langeway J, Wu Y, Yerazunis WS (2007) Three non-bayesian methods of spam filtration: CRM114 at TREC 2007. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274
Dietterich TG (2000) Ensemble methods in machine learning. In: MCS2000 Proceedings of the multiple classifier systems, pp 1–15

Download references

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, 410073, Changsha, Hunan, China
Wuying Liu & Ting Wang
Department of Language Engineering, PLA University of Foreign Languages, 471003, Luoyang, Henan, China
Wuying Liu

Authors

Wuying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ting Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wuying Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Wang, T. Online active multi-field learning for efficient email spam filtering. Knowl Inf Syst 33, 117–136 (2012). https://doi.org/10.1007/s10115-011-0461-x

Download citation

Received: 05 August 2010
Revised: 10 October 2011
Accepted: 15 November 2011
Published: 30 November 2011
Issue Date: October 2012
DOI: https://doi.org/10.1007/s10115-011-0461-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online active multi-field learning for efficient email spam filtering

Abstract

Access this article

Similar content being viewed by others

Spam E-Mail Classification Based on the IFWB Algorithm

Training Logistic Regression Model by Enhanced Moth Flame Optimizer for Spam Email Classification

Supervised classification of spam emails with natural language stylometry

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Online active multi-field learning for efficient email spam filtering

Abstract

Access this article

Similar content being viewed by others

Spam E-Mail Classification Based on the IFWB Algorithm

Training Logistic Regression Model by Enhanced Moth Flame Optimizer for Spam Email Classification

Supervised classification of spam emails with natural language stylometry

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation