Spam Detection Using Character N-Grams

  • Ioannis Kanaris
  • Konstantinos Kanaris
  • Efstathios Stamatatos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3955)


This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional ’bag of words’ representation, we use a ’bag of character n-grams’ representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or ’deep’ text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.


Support Vector Machine Suffix Tree Binary Attribute Spam Detection Spam Message 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  2. 2.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-mail. In: Proc. of AAAI Workshop on Learning for Text Categorization (1998)Google Scholar
  3. 3.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Potamias, G., Moustakis, V., van Someren, M. (eds.) Proc. of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, pp. 9–17 (2000)Google Scholar
  4. 4.
    Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)CrossRefGoogle Scholar
  5. 5.
    Drucker, H., Wu, D., Vapnik, V.: Support Vector Machines for Spam Categorization. IEEE Trans. Neural Network 10, 1048–1054 (1999)CrossRefGoogle Scholar
  6. 6.
    Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical report 2004/2, NCSR Demokritos (2004)Google Scholar
  7. 7.
    Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proc. 3rd Int’l Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)Google Scholar
  8. 8.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Assoc. Comp. Linguistics (2003)Google Scholar
  9. 9.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification Using String Kernels. The Journal of Machine Learning Research 2, 419–444 (2002)zbMATHGoogle Scholar
  10. 10.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefzbMATHGoogle Scholar
  11. 11.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proc. of the European Conference on Machine Learning (1998)Google Scholar
  12. 12.
    Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking Classifiers for Anti-Spam Filtering of E-Mail. In: Proc. of 6th Conf. Empirical Methods in Natural Language Processing, pp. 44–50 (2001)Google Scholar
  13. 13.
    Hovold, J.: Naive Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam (2005)Google Scholar
  14. 14.
    Yang, Y., Petersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  15. 15.
    Pampapathi, R., Mirkin, B., Levene, M.: A Suffix Tree Approach to Text Categorisation Applied to Spam Filtering,
  16. 16.
    Berger, H., Koehle, M., Merkl, D.: On the Impact of Document Representation on Classifier Performance in e-Mail Categorization. In: Proc. of the 4th International Conference on Information Systems Technology and its Applications, pp. 19–30 (2005)Google Scholar
  17. 17.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ioannis Kanaris
    • 1
  • Konstantinos Kanaris
    • 2
  • Efstathios Stamatatos
    • 1
  1. 1.Dept. of Information and Communication Systems Eng.University of the AegeanKarlovassiGreece
  2. 2.Dept. of MathematicsUniversity of the AegeanKarlovassiGreece

Personalised recommendations