Abstract
Email spam is one of the biggest threats to today’s Internet. To deal with this threat, there are long-established measures like supervised anti-spam filters. In this paper, we report the development and evaluation of sentinel—an anti-spam filter based on natural language and stylometry attributes. The performance of the filter is evaluated not only on non-personalized emails (i.e., emails collected randomly) but also on personalized emails (i.e., emails collected from particular individuals). Among the non-personalized datasets are CSDMC2010, SpamAssassin, and LingSpam, while the Enron-Spam collection comprises personalized emails. The proposed filter extracts natural language attributes from email text that are closely related to writer stylometry and generate classifiers using multiple learning algorithms. Experimental outcomes show that classifiers generated by meta-learning algorithms such as adaboostm1 and bagging are the best, performing equally well and surpassing the performance of a number of filters proposed in previous studies, while a random forest generated classifier is a close second. On the other hand, the performance of classifiers using support vector machine and Naïve Bayes is not satisfactory. In addition, we find much improved results on personalized emails and mixed results on non-personalized emails.
Similar content being viewed by others
Notes
Most of the public email datasets are imbalanced [19].
Downloadable at http://nlp.stanford.edu/software/tagger.shtml.
Available at: http://www.languagetool.org/java-api/.
Downloadable at http://jsoup.org/download.
Downloadable at http://spamassassin.apache.org/publiccorpus/.
Downloadable at http://csmining.org/index.php/spam-email-datasets-.html.
Downloadable at http://csmining.org/index.php/ling-spam-datasets.html.
Consult with http://www.projecthoneypot.org.
Overview at http://untroubled.org/spam.
References
Abi-Haidar A, Rocha LM (2008a) Adaptive spam detection inspired by a cross-regulation model of immune dynamics: a study of concept drift. In: Artificial immune systems. Springer, Berlin, pp 36–47
Abi-Haidar A, Rocha LM (2008b) Adaptive spam detection inspired by the immune system. In: ALIFE, pp 1–8
Afroz S, Brennan M, Greenstadt R (2012) Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE symposium on security and privacy (SP), pp 461–475
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: 23rd Annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 160–167
Bickel S (2006) Ecml-pkdd discovery challenge 2006 overview. In: Proceedings of the ECML/PKDD discovery challenge workshop, pp 1–9
Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell 29(1):63–92
Bratko A, Cormack GV, R D, Filipic B, Chan P, Lynam TR (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: RANLP-2001, 4th International conference on recent advances in natural language processing, pp 58–64
Cheng V, Li C (2007) Combining supervised and semi-supervised classifier for personalized spam filtering. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 449–456. doi:10.1007/978-3-540-71701-0_45
Cheng V, Li CH (2006) Personalized spam filtering with semi-supervised classifier ensemble. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006), pp 195–201. doi:10.1109/WI.2006.132
Commtouch (2013) Internet threats trend report. Technical report, Commtouch, USA. http://www.commtouch.com/uploads/2013/04/Commtouch-Internet-Threats-Trend-Report-2013-April.pdf
Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of the sixteenth text retrieval conference, TREC 2007. http://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf
Cormack GV, Bratko A (2006) Batch and online spam filter comparison. In: Conference on email and anti-spam, CEAS 2006, Mountain View, CA
Cormack GV, Lynam TR (2005) TREC 2005 spam track overview. In: Proceedings of the fourteenth text retrieval conference, TREC 2005. http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf
Drummond C, Holte R (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130
Goodman J, Cormack GV, Heckerman D (2007) Spam and the ongoing battle for the inbox. Commun ACM 50(2):24–33
Graham P (2003) A plan for spam. http://paulgraham.com/spam.html
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10,206–10,222
Haider P, Brefeld U, Scheffer T (2007) Supervised clustering of streaming data for email batch detection. In: 24th International conference on machine learning. ACM, pp 345–352
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer Series in Statistics. Springer, Berlin
Holte RC, Drummond C (2008) Cost-sensitive classifier evaluation using cost curves. Lecture Notes in Computer Science. In: Washio T, Suzuki E, Ting KM, Inokuchi A (eds) Pacific-Asia conference on knowledge discovery and data mining (PAKDD), vol 5012. Springer, Berlin, pp 26–29
Hu Y, Guo C, Ngai EWT, Liu M, Chen S (2010) A scalable intelligent non-content-based spam-filtering framework. Expert Syst Appl 37(12):8557–8565
Iqbal F, Khan LA, Fung BCM, Debbabi M (2010) E-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM symposium on applied computing, ACM, New York, NY, SAC ’10, pp 1591–1598
Issac B, Jap WJ, Sutanto JH (2009) Improved Bayesian anti-spam filter implementation and analysis on independent spam corpuses. In: 2009 International conference on computer engineering and technology, vol 02. IEEE Computer Society, pp 326–330
Kosmopoulos A, Paliouras G, Androutsopoulos A (2008) Adaptive spam filtering using only naive Bayes text classifiers. In: Fifth conference on email and anti-spam (CEAS 2008)
Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13. http://www.jstatsoft.org/v36/i11/
Lai CC, Tsai MC (2004) An empirical performance comparison of machine learning methods for spam e-mail categorization. In: Fourth international conference on hybrid intelligent systems. IEEE Computer Society, HIS ’04, pp 44–48
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
Ma Q, Qin Z, Zhang F, Liu Q (2010) Text spam neural network classification algorithm. In: 2010 International conference on communications. Circuits and systems (ICCCAS), pp 466–469
Meng Y, Li W, Kwok L (2014) Enhancing email classification using data reduction and disagreement-based semi-supervised learning. In: IEEE international conference on communications, ICC 2014, Sydney, Australia, pp 622–627. doi:10.1109/ICC.2014.6883388
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive Bayes—Which naive Bayes? In: Third conference on email and anti-spam (CEAS)
Mojdeh M, Cormack GV (2008) Semi-supervised spam filtering: does it work? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2008, pp 745–746. doi:10.1145/1390334.1390482
Orăsan C, Krishnamurthy R (2002) A corpus-based investigation of junk emails. In: Third international conference on language resources and evaluation (LREC-2002), Spain, pp 1773–1780
Prabhakar R, Basavaraju M (2010) A novel method of spam mail detection using text based clustering approach. Int J Comput Appl 5(4):15–25. published By Foundation of Computer Science
Qaroush A, Khater IM, Washaha M (2012) Identifying spam e-mail based-on statistical header features and sender behavior. In: CUBE international information technology conference. ACM, pp 771–778
Razmara M, Razmara A, Narouei M (2012) Textual spam detection: an iterative pattern mining approach. World Appl Sci J 20(2):198–204
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, AAAI Technical Report WS-98-05, pp 55–62
Schapire RE (1999) A brief introduction to boosting. In: 16th international joint conference on Artificial intelligence, vol 2, Morgan Kaufmann Publishers Inc., Los Altos, CA, IJCAI’99, pp 1401–1406
Shams R, Mercer RE (2013) Classifying spam emails using text and readability features. In: 2013 IEEE 13th international conference on data mining, pp 657–666. doi:10.1109/ICDM.2013.131
Shen X, Tseng GC, Zhang X, Wong WH (2003) On psi-learning. J Am Stat Assoc 98:724–734. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:98:y:2003:p:724-734
Sheu JJ (2009) An efficient two-phase spam filtering method based on e-mails categorization. Int J Netw Secur 9(1):34–43
Sirisanyalak B, Sornil O (2007) Artificial immunity-based feature extraction for spam detection. In: Software engineering, artificial intelligence, networking, and parallel/distributed computing. SNPD 2007. Eighth ACIS international conference on, vol 3, pp 359–364
Vapnik V (1998) Statistical learning theory. Wiley, New York
Wang J, Shen X (2007) Large margin semi-supervised learning. J Mach Learn Res 8:1867–1891. http://dl.acm.org/citation.cfm?id=1314561
Xu JM, Fumera G, Roli F, Zhou ZH (2009) Training spamassassin with active semi-supervised learning. In: Sixth conference on email and anti-spam
Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl Based Syst 24(6):904–914
Ye M, Tao T, Mai FJ, Cheng XH (2008) A spam discrimination based on mail header feature and SVM. In: Fourth international conference on wireless communications, networking and mobile computing (WiCom08), pp 1–4
Zhan J, Oommen BJ, Crisostomo J (2011) Anomaly detection in dynamic systems using weak estimators. ACM Trans Internet Technol 11(1):3:1–3:16
Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497
Acknowledgments
Support for this work was provided through a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to Robert E. Mercer (Grant No. 36853–2010 RGPIN). We are indebted to Vangelis Metsis, Aris Kosmopoulos, and Robert Holte for their correspondences regarding the use of their term frequency attribute and Cost Curve Tool.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Shams, R., Mercer, R.E. Supervised classification of spam emails with natural language stylometry. Neural Comput & Applic 27, 2315–2331 (2016). https://doi.org/10.1007/s00521-015-2069-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-015-2069-7