Digital Waste Sorting: A Goal-Based, Self-Learning Approach to Label Spam Email Campaigns

Sheikhalishahi, Mina; Saracino, Andrea; Mejri, Mohamed; Tawbi, Nadia; Martinelli, Fabio

doi:10.1007/978-3-319-24858-5_1

Mina Sheikhalishahi¹⁴,
Andrea Saracino¹⁵,
Mohamed Mejri¹⁴,
Nadia Tawbi¹⁴ &
…
Fabio Martinelli¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9331))

Included in the following conference series:

International Workshop on Security and Trust Management

827 Accesses
4 Citations

Abstract

Fast analysis of correlated spam emails may be vital in the effort of finding and prosecuting spammers performing cybercrimes such as phishing and online frauds. This paper presents a self-learning framework to automatically divide and classify large amounts of spam emails in correlated labeled groups. Building on large datasets daily collected through honeypots, the emails are firstly divided into homogeneous groups of similar messages (campaigns), which can be related to a specific spammer. Each campaign is then associated to a class which specifies the goal of the spammer, i.e. phishing, advertisement, etc. The proposed framework exploits a categorical clustering algorithm to group similar emails, and a classifier to subsequently label each email group. The main advantage of the proposed framework is that it can be used on large spam emails datasets, for which no prior knowledge is provided. The approach has been tested on more than 3200 real and recent spam emails, divided in more than 60 campaigns, reporting a classification accuracy of 97 % on the classified data.

This research has been partially supported by EU Seventh Framework Programme (FP7/2007–2013) under grant no 610853 (COCO Cloud), MIUR-PRIN Security Horizons and Natural Sciences and Engineering Research Council of Canada (NSERC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Spam archive. http://untroubled.org/spam/
Federal trade commission (2009). http://www.consumer.ftc.gov
Almomani, A., Gupta, B.B., Atawneh, S., Meulenberg, A., Almomani, E.: A survey of phishing email filtering techniques. IEEE Commun. Surv. Tutorials 15(4), 2070–2090 (2013)
Article Google Scholar
Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., Popp, J.: Sample size planning for classification models. Anal. Chim. Acta 760, 25–33 (2013)
Article Google Scholar
Benczur, A.A., Csalogany, K., Sarlos, T., Uher, M.: Spamrank-fully automatic link spam detection work in progress. In: Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (2005)
Google Scholar
Bergholz, A., PaaB, G., Reichartz, F., Strobel, S., Birlinghoven, S.: Improved phishing detection using model-based features. In: CEAS (2008)
Google Scholar
Calais, P., Douglas, E.V.P., Dorgival, O.G., Wagner, M., Cristine, H., Klaus, S.: A campaign-based characterization of spamming strategies. In: CEAS (2008)
Google Scholar
Chen, T.C., Stepan, T., Dick, S., Miller, J.: An anti-phishing system employing diffused information. ACM Trans. Inf. Syst. Secur. 16(4), 16:1–16:31 (2014)
Google Scholar
da Cruz Nassif, L., Hruschka, E.: Document clustering for forensic analysis: An approach for improving computer inspection. IEEE Trans. Inf. Forensics Secur. 8(1), 46–54 (2013)
Article Google Scholar
Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(1), S12–S21 (2015)
Article Google Scholar
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM (2007)
Google Scholar
Gansterer, W.N., Pölz, D.: E-Mail classification for phishing defense. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 449–460. Springer, Heidelberg (2009)
Chapter Google Scholar
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.: Detecting and characterizing social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC 2010, pp. 35–47. ACM, New York (2010)
Google Scholar
Hadjidj, R., Debbabi, M., Lounis, H., Iqbal, F., Szporer, A., Benredjem, D.: Towards an integrated e-mail forensic analysis framework. Digit. Invest. 5(34), 124–137 (2009)
Article Google Scholar
Henderson, L.: Crimes of Persuasion: Schemes, Scams, Frauds : how Con Artists Will Steal Your Savings and Inheritance Through Telemarketing Fraud Investment Schemes and Consumer Scams. Coyoto Ridge Press, Azilda (2003)
Google Scholar
Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G.M., Paxson, V., Savage, S.: Spamalytics: An empirical analysis of spam marketing conversion. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS 2008, pp. 3–14. ACM, New York (2008)
Google Scholar
Kanich, C., Weavery, N., McCoy, D., Halvorson, T., Kreibichy, C., Levchenko, K., Paxson, V., Voelker, G., Savage, S.: Show me the money: Characterizing spam-advertised revenue. In: Proceedings of the 20th USENIX Conference on Security, SEC 2011. USENIX Association, Berkeley (2011)
Google Scholar
Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: An inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET 2009. USENIX Association, Berkeley (2009)
Google Scholar
Labs, M.A.: Mcafee threats report: 2015 (2015). http://mcafee.com
Narang, S.: Cryptolocker alert: Millions in the uk targeted in mass spam campaign. (2013). http://www.symantec.com/connect/tr/blogs/cryptolocker-alert-millions-uk-targeted-mass-spam-campaign
Pathak, A., Qian, F., Hu, Y.C., Mao, Z.M., Ranjan, S.: Botnet spam campaigns can be long lasting: Evidence, implications, and analysis. SIGMETRICS Perform. Eval. Rev. 37(1), 13–24 (2009)
Google Scholar
Seewald, A.K.: An evaluation of naive bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Article MathSciNet Google Scholar
Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: 1st International Conference on Information Systems Security and Privacy (2015)
Google Scholar
Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Fast and effective clustering of spam emails based on structural similarity (2015). http://goo.gl/zlzHNl
Tillman, K.: How many internet connections are in the world? right. now (2013). http://blogs.cisco.com/news/cisco-connections-counter
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, pp. I-511–I-518 (2001)
Google Scholar
Wang, D., Irani, D., Pu, C.: A study on evolution of email spam over fifteen years. In: 2013 9th International Conference on Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), pp. 1–10, October 2013
Google Scholar
Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM symposium on Applied computing, SAC 2008, pp. 1433–1437. ACM, New York (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Université Laval, Québec City, Canada
Mina Sheikhalishahi, Mohamed Mejri & Nadia Tawbi
Istituto di Informatica e Telematica, Consiglio Nazionale delle ricerche, Pisa, Italy
Andrea Saracino & Fabio Martinelli

Authors

Mina Sheikhalishahi
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Saracino
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Mejri
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Tawbi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Martinelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mina Sheikhalishahi .

Editor information

Editors and Affiliations

Dipartimento di Informatica, Università degli Studi di Milano, Crema, Italy
Sara Foresti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F. (2015). Digital Waste Sorting: A Goal-Based, Self-Learning Approach to Label Spam Email Campaigns. In: Foresti, S. (eds) Security and Trust Management. STM 2015. Lecture Notes in Computer Science(), vol 9331. Springer, Cham. https://doi.org/10.1007/978-3-319-24858-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-24858-5_1
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24857-8
Online ISBN: 978-3-319-24858-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics