Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes

  • J. R. Méndez
  • B. Corzo
  • D. Glez-Peña
  • F. Fdez-Riverola
  • F. Díaz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4571)

Abstract

Spam is a complex problem that makes difficult the exploitation of Internet resources. In this sense, several authorities have alerted about the dimension of this problem and aim everybody to fight against it. In this paper we present an extensive analysis showing how the effect of changing the dimensionality of message representation influences the accuracy of some well-known classical spam filtering techniques. The conclusions drawn from the experiments carried out will be useful for building a comparison of the dimensionality reorganization effects between classical filtering techniques and a successful spam filter model called SpamHunting.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    AVANZA Plan: Spanish Council of Industry, Tourism and Trade (2006), http://www.planavanza.es
  2. 2.
    eEurope 2002 Plan: European Commission (2002), http://europa.eu.int/information_society/eeurope/2002/index_en.htm
  3. 3.
    Ingenio-2010 Plan: Spanish Council of Presidency (2007), http://www.ingenio2010.es
  4. 4.
    eEurope 2005 Plan: European Commission (2005), http://europa.eu.int/information_society/eeurope/2005/index_en.htm
  5. 5.
  6. 6.
    Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proc. of the Workshop on Machine Learning in the New Information Age at 11th European Conference on Machine Learning, pp. 9–17 (2000)Google Scholar
  7. 7.
    John, G., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Proc. of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)Google Scholar
  8. 8.
    Freund, Y., Schapire, R.E.: A Decision-Theoretic Generalization on On-Line Learning and Application to Boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Vapnick, V.: The nature of Statistical Learning Theory. In: Statistic for Engineering and Information Science, 2nd edn., Springer, Heidelberg (1999)Google Scholar
  10. 10.
    Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying Lazy Learning Algorithms to Trackle Concept Drift in Spam Filtering. Expert Systems With Applications 33(1), 36–48 (2007)CrossRefGoogle Scholar
  11. 11.
    Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Méndez, J.R., Corchado, J.M.: Spam Hunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. Decision Support Systems (in press, 2007)Google Scholar
  12. 12.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, Stemming and Stopword Removal on the Spam Filtering Domain. In: Proc. of the 11th Conference of the Spanish Association for Artificial Intelligence, pp. 449–458 (2005)Google Scholar
  13. 13.
    Méndez, J.R., Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Corchado, J.M.: A Comparative Performance Study of Feature Selection Methods for the Anti-Spam Filtering Domain. In: Proc. of the 6th Industrial Conference on Data Mining, pp. 106–120 (2006)Google Scholar
  14. 14.
    Watson, I.: Case-based reasoning is a methodology not a technology. Knowledge-Based Systems 12(5-6), 303–308 (1999)CrossRefGoogle Scholar
  15. 15.
    Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)Google Scholar
  16. 16.
    Rijsbergen, C.J.: Information Retrieval. Ed. Butterworth (1979)Google Scholar
  17. 17.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, London (1999)Google Scholar
  18. 18.
    Yang, Y., Pedersen, J.O.: A Comparative Study of Feature Selection in Text Categorization. In: Proc. of the 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  19. 19.
    Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M.: Tracking Concept Drift at Feature Selection Stage in SpamHunting: an Anti-Spam Instance-Based Reasoning System. In: Proc. of the 8th European Conference on Case-Based Reasoning, pp. 504–518 (2006)Google Scholar
  20. 20.
    SpamHaus: SBL, SpamHaus Block List (2005), http://www.spamhaus.org/sbl
  21. 21.
    Trend Micro Incorporated: MAPS (2005), http://www.mail-abuse.com
  22. 22.
    SpammerX: Inside the Spam Cartel: Trade Secrets from the Dark Side. Syngress Publishing (2004)Google Scholar
  23. 23.
    Prakash, V.V.: Vipul’s Razor (2005), http://razor.sf.net
  24. 24.
    Pyzor: Pizor Page (2005), http://pyzor.sf.net
  25. 25.
    Rhyolite Software: DCC: Distributed Checksum ClearingHouse (2000), http://www.rhyolite.com/anti-spam/dcc/
  26. 26.
    Rigoutsos, I., Huynh, T.: Cheng-Kwey: a Pattern-Discovery-Based System for the Automatic Identification of Unsolicited E-Mail Messages (Spam). In: CEAS-2004. Proc. of the 1st Conference on Email and Anti-Spam (2004), http://www.ceas.cc/papers-2004/index.html
  27. 27.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Proc. of the 5th International Conference on Case-Based Reasoning, pp. 115–123 (2003)Google Scholar
  28. 28.
    Delany, S.J., Cunningham, P., Coile, L.: An Assesment of Case-Based Reasoning for Spam Filtering. In: Proc. of the 15th Irish Conference on Case-Based Reasoning, pp. 9–18 (2004)Google Scholar
  29. 29.
    Kinley, A.: Acquiring Similarity Cases for Classification Problems. In: Proc. of the 6th International Conference on Case-Based Reasoning, pp. 327–338 (2005)Google Scholar
  30. 30.
    Crocker, D.: Standard for the Format of ARPA Internet Text Messages. STD 11, RFC 822 (1982), http://www.faqs.org/rfcs/rfc822.html
  31. 31.
    Oliver, J.J., Hand, D.: Averaging over decision stumps. In: Proc. of the 7th European Conference on Machine Learning, pp. 231–241 (1994)Google Scholar
  32. 32.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th International Joint Conference on Artificial Intelligence, pp. 1137-1143 (1995)Google Scholar
  33. 33.
    Méndez, J.R.: Adaptative system with intelligent labelling for spam e-mail classification. Ph Dissertation. Computer Science Department. University of Vigo (2006)Google Scholar
  34. 34.
    Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management 33(1), 1–14 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • J. R. Méndez
    • 1
  • B. Corzo
    • 2
  • D. Glez-Peña
    • 1
  • F. Fdez-Riverola
    • 1
  • F. Díaz
    • 3
  1. 1.Computer Science Dept., University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, OurenseSpain
  2. 2.Dept. Advertising Graphics, Arts College of Oviedo, C/ Julián Clavería, 12, 33006, OviedoSpain
  3. 3.Computer Science Dept., University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, SegoviaSpain

Personalised recommendations