Abstract
Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labeling. In: ICDM 2008 Proceedings of the 8th IEEE international conference on data mining, pp 869–874
Wozniak M (2010) A hybrid decision tree training method using data streams. Knowl Inf Syst, Online FirstTM, 05 Oct 2010
Chang M, Yih W, Meek C (2008) Partitioned logistic regression for spam filtering. In: SIGKDD 2008 Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–105
Lee C-H (2010) Learning to combine discriminative classifiers: confidence based. In: SIGKDD 2010 Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 743–752
Cormack GV, Lynam T (2005) TREC 2005 spam track overview. In: TREC2005 Proceedings of the 14th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–266
Tong S, Koller D (2002) Support vector machine active learning with applications to text classification. J Mach Learn Res 2: 45–66
Cesa-Bianchi N, Gentile C, Zaniboni L (2006) Worst-case analysis of selective sampling for linear classification. J Mach Learn Res 7: 1205–1230
Chai KMA, Chieu HL, Tou H (2002) Bayesian online classifiers for text classification and filtering. In: SIGIR’02 Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 97–104
Sculley D, Wachman GM (2007) Relaxed online SVMs for spam filtering. In: SIGIR’07 Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 415–422
Cormack GV (2007) University of waterloo participation in the TREC 2007 spam track. In TREC2007: Notebook of the 16th text retrieval conference, National Institute of Standards and Technology
Cormack GV (2007) TREC 2007 spam track overview. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274
Verikas A, Guzaitis J, Gelzinis A, Bacauskiene M (2010) A general framework for designing a fuzzy rule-based classifier. Knowl Inf Syst, Online FirstTM, 16 Sept 2010
Yoo S, Yang Y, Lin F, Moon I-C (2009) Mining social networks for personalized email prioritization. In: SIGKDD 2009 Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 967–976
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391
Liu W, Wang T (2010) Multi-field learning for email spam filtering. In: SIGIR’10 Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, pp 745–746
Cormack GV (2006) TREC 2006 spam track overview. In: TREC2006 Proceedings of the 15th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–272
Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: CEAS2007 Proceedings of the 4th conference on email and anti-spam
Goodman J, Yih W (2006) Online discriminative spam filter training. In: CEAS2006 Proceedings of the 3rd conference on email and anti-spam
Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054
Sanchez F, Duan Z, Dong Y (2010) Understanding forgery properties of spam delivery paths. In: CEAS2010 Proceedings of the 7th annual collaboration, electronic messaging, anti-abuse and spam conference. http://ceas.cc/2010/papers/Paper%2012.pdf
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94 Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp 3–12
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the 11th International Conference on Machine Learning, pp 48–156
Malik HH, Fradkin D, Moerchen F (2010) Single pass text classification by direct feature weighting. Knowl Inf Syst, Online FirstTM, 25 June 2010
Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Comput Surv 38(2):Article 6
Graham P (2002) A plan for spam. http://www.paulgraham.com/spam.html
Graham P (2003) Better bayesian filtering. http://www.paulgraham.com/better.html, In the 2003 Spam Conference
Sculley D, Wachman GM (2007) Relaxed online SVMs in the TREC spam filtering track. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274
Cormack GV (2008) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4): 335–455
Kato M, Langeway J, Wu Y, Yerazunis WS (2007) Three non-bayesian methods of spam filtration: CRM114 at TREC 2007. In: TREC2007 Proceedings of the 16th text retrieval conference, National Institute of Standards and Technology, Special Publication 500–274
Dietterich TG (2000) Ensemble methods in machine learning. In: MCS2000 Proceedings of the multiple classifier systems, pp 1–15
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, W., Wang, T. Online active multi-field learning for efficient email spam filtering. Knowl Inf Syst 33, 117–136 (2012). https://doi.org/10.1007/s10115-011-0461-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0461-x