Advertisement

Assessing Classification Accuracy in the Revision Stage of a CBR Spam Filtering System

  • José Ramón Méndez
  • Carlos González
  • Daniel Glez-Peña
  • Florentino Fdez-Riverola
  • Fernando Díaz
  • Juan Manuel Corchado
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4626)

Abstract

In this paper we introduce a quality metric for characterizing the solutions generated by a successful CBR spam filtering system called SpamHunting. The proposal is denoted as relevant information amount rate and it is based on combining estimations about relevance and amount of information recovered during the retrieve stage of a CBR system. The results obtained from experimentation show how this measure can successfully be used as a suitable complement for the classifications computed by our SpamHunting system. In order to evaluate the performance of the quality estimation index, we have designed a formal benchmark procedure that can be used to evaluate any accuracy metric. Finally, following the designed test procedure, we show the behaviour of the proposed measure using two well-known publicly available corpus.

Keywords

Receiver Operating Characteristic Feature Selection Receiver Operating Characteristic Curve Receiver Operating Characteristic Analysis Negative Likelihood Ratio 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Sharp Rock Technologies LTD: Dialer.net Mobile Internet Access (2000), http://mobile.dialer.net/
  2. 2.
    Zennstrm, N., Friis, J.: Skype (2003), http://www.skype.com
  3. 3.
    Oard, D.V.: The state of the art in text filtering. User Modeling and Use-Adapted Interaction 7, 141–178 (1997)CrossRefGoogle Scholar
  4. 4.
    Androutsopoulos, I., Palioureas, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical report TR-2004-2. NCSR: National Centre for Scientific Research “Demokritos” (2004)Google Scholar
  5. 5.
    Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-Base Reasoning for Spam Filtering. In: Proc. of the 15th Irish Conference on Artificial Intelligence and Cognitive Science, pp. 9–18 (2004)Google Scholar
  6. 6.
    Delany, S.J., Cunningham, P., Smyth, B.: ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift. In: Proc. of the 17th European Conference on Artificial Intelligence, pp. 627–631 (2006)Google Scholar
  7. 7.
    Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M.: Tracking Concept Drift at Feature Selection Stage in SpamHunting: an Anti-Spam Instance-Based Reasoning System. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 504–518. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Méndez, J.R., Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Corchado, J.M.: A Comparative Performance Study of Feature Selection Methods for the Anti-Spam Filtering Domain. In: Proc. of the 6th Industrial Conference on Data Mining, pp. 106–120 (2006)Google Scholar
  9. 9.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, Stemming and Stopword Removal on the Spam Filtering Domain. In: Marín, R., Onaindía, E., Bugarín, A., Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 449–458. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. Expert Systems With Applications 33(1), 36–48 (2007)CrossRefGoogle Scholar
  11. 11.
    Delany, S.J., Cunningham, P., Doyle, D., Zamolotskikh, A.: Generating Estimates of Classification Confidence for A Case-Based Spam Filter. In: Muñoz-Ávila, H., Ricci, F. (eds.) ICCBR 2005. LNCS (LNAI), vol. 3620, pp. 177–190. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Fdez-Riverola, F., Iglesias, E. L., Díaz, F., Méndez, J. R., Corchado, J. M.: SpamHunting: An Instance-Based Reasoning System for Spam Labeling and Filtering. Decision Support Systems (in press, 2007)Google Scholar
  13. 13.
    Widmer, G., Kubat, M.: Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning 23(1), 69–101 (1996)Google Scholar
  14. 14.
    Fdez-Riverola, F., Méndez, J.R., Iglesias, E.L., Díaz, F.: Representación Flexible de E-mails para la Construcción de Filtros Anti-Spam: un caso práctico. In: Proc. of the first Spanish Conference on Computer Science, pp. 109–116 (2005)Google Scholar
  15. 15.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: A Comparative Impact Study of Corpus Preprocessing for the Construction of Anti-Spam Filtering Software. In: Proc. of the 10th Conference of the Spanish Association for the Artificial Intelligence, pp. 29–38 (2005)Google Scholar
  16. 16.
    John, G.H., Langley, P.: Estimating Continuous Distributions in Bayes Classifiers. In: Proc. of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)Google Scholar
  17. 17.
    Vapnik, V.: The Nature of Statistical Learning Theory. 2nd Ed. Statistics for Engineering and Information Science (1999)Google Scholar
  18. 18.
    Carreras, X., Márquez, L.: Boosting Trees for Anti-Spam e-Mail Filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)Google Scholar
  19. 19.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  20. 20.
    Rigoutsos, I., Huynh, T.: Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. In: Proc. of the First Conference on Email and Anti-Spam (2004)Google Scholar
  21. 21.
    Egan, J.P.: Signal Detection Theory and ROC Analysis. Academic Press, New York (1975)Google Scholar
  22. 22.
    Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8, 283–298 (1978)CrossRefGoogle Scholar
  23. 23.
    Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry 39, 561–577 (1993)Google Scholar
  24. 24.
    Griner, P.F., Mayewski, R.J., Mushlin, A.I., Greenland, P.: Selection and interpretation of diagnostic tests and procedures. Annals of Internal Medicine 94, 555–600 (1981)Google Scholar
  25. 25.
    Hasselband, V., Hedges, L.: Meta-analysis of diagnostics test. Psychological Bulletin 117, 167–178 (1995)CrossRefGoogle Scholar
  26. 26.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)Google Scholar
  27. 27.
    Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In: Proc. of the 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611 (2007)Google Scholar
  28. 28.
    Strube, M., Ponzetto, P.: WikiRelate! Computing Semantic Relatedness using Wikipedia. In: Proc. of the 22th American Association for Artificial Intelligence Conference (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • José Ramón Méndez
    • 1
  • Carlos González
    • 2
  • Daniel Glez-Peña
    • 1
  • Florentino Fdez-Riverola
    • 1
  • Fernando Díaz
    • 3
  • Juan Manuel Corchado
    • 4
  1. 1.Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, OurenseSpain
  2. 2.GFI Informatique, C/ Salvatierra 5, 28034, MadridSpain
  3. 3.Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, SegoviaSpain
  4. 4.Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, SalamancaSpain

Personalised recommendations