Advertisement

Spam Email Filtering Using Network-Level Properties

  • Paulo Cortez
  • André Correia
  • Pedro Sousa
  • Miguel Rocha
  • Miguel Rio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6171)

Abstract

Spam is serious problem that affects email users (e.g. phishing attacks, viruses and time spent reading unwanted messages). We propose a novel spam email filtering approach based on network-level attributes (e.g. the IP sender geographic coordinates) that are more persistent in time when compared to message content. This approach was tested using two classifiers, Naive Bayes (NB) and Support Vector Machines (SVM), and compared against bag-of-words models and eight blacklists. Several experiments were held with recent collected legitimate (ham) and non legitimate (spam) messages, in order to simulate distinct user profiles from two countries (USA and Portugal). Overall, the network-level based SVM model achieved the best discriminatory performance. Moreover, preliminary results suggests that such method is more robust to phishing attacks.

Keywords

Anti-Spam filtering Text Mining Naive Bayes Support Vector Machines 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Beverly, R., Sollins, K.: Exploiting transport-level characteristics of spam. In: 5th Conference on Email and Anti-Spam, CEAS (2008)Google Scholar
  2. 2.
    Bilisoly, R.: Practical text mining with Perl. Wiley Publishing, Chichester (2008)zbMATHCrossRefGoogle Scholar
  3. 3.
    Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29(1), 63–92 (2008)CrossRefGoogle Scholar
  4. 4.
    Cherkassy, V., Ma, Y.: Practical Selection of SVM Parameters and Noise Estimation for SVM Regression. Neural Networks 17(1), 113–126 (2004)CrossRefGoogle Scholar
  5. 5.
    Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20(3), 273–297 (1995)zbMATHGoogle Scholar
  6. 6.
    Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47(4), 547–553 (2009)CrossRefGoogle Scholar
  7. 7.
    Cortez, P., Lopes, C., Sousa, P., Rocha, M., Rio, M.: Symbiotic Data Mining for Personalized Spam Filtering. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI 2009), pp. 149–156. IEEE, Los Alamitos (2009)Google Scholar
  8. 8.
    Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural networks 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  9. 9.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874 (2006)CrossRefGoogle Scholar
  10. 10.
    Flexer, A.: Statistical Evaluation of Neural Networks Experiments: Minimum Requirements and Current Practice. In: Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Vienna, Austria, vol. 2, pp. 1005–1008 (1996)Google Scholar
  11. 11.
    Leiba, B., Ossher, J., Rajan, V.T., Segal, R., Wegman, M.: SMTP path analysis. In: Proceedings of the Second Conference on E-mail and Anti-Spam, CEAS (2005)Google Scholar
  12. 12.
    Lin, H.T., Lin, C.J., Weng, R.C.: A note on Platts probabilistic outputs for support vector machines. Machine Learning 68(3), 267–276 (2007)CrossRefGoogle Scholar
  13. 13.
    MAAWG. Email Metrics Program: The Network Operators’ Perspective. Report #10 – third and fourth quarter 2008, Messaging Anti-Abuse Working Group, S. Francisco, CA, USA (March 2009)Google Scholar
  14. 14.
    Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes? In: Third Conference on Email and Anti-Spam, CEAS (2006)Google Scholar
  15. 15.
    Nelson, B., Barreno, M., Chi, F., Joseph, A., Rubinstein, B., Saini, U., Sutton, C., Tygar, J., Xia, K.: Exploiting Machine Learning to Subvert Your Spam Filter. In: 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, pp. 1–9. ACM Press, New York (2008)Google Scholar
  16. 16.
    R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2009), ISBN 3-900051-00-3 http://www.R-project.org
  17. 17.
    Ramachandran, A., Feamster, N.: Understanding the Network-Level Behavior of Spammers. In: ACM (ed.) SIGCOMM 2006, pp. 291–302 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Paulo Cortez
    • 1
  • André Correia
    • 1
  • Pedro Sousa
    • 3
  • Miguel Rocha
    • 3
  • Miguel Rio
    • 2
  1. 1.Dep. of Information Systems/AlgoritmiUniversity of MinhoGuimarãesPortugal
  2. 2.Dep. of InformaticsUniversity of MinhoBragaPortugal
  3. 3.Department of Electronic and Electrical EngineeringUniversity College LondonLondonUK

Personalised recommendations