Using Biased Discriminant Analysis for Email Filtering

  • Juan Carlos Gomez
  • Marie-Francine Moens
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6276)

Abstract

This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks.

We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus.

The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aha, D.W., Kibler, D.F., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)Google Scholar
  2. 2.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Ch, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of naive bayesian anti-spam filtering. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 9–17. Springer, Heidelberg (2000)Google Scholar
  3. 3.
    Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Compututation 12(10), 2385–2404 (2000)CrossRefGoogle Scholar
  4. 4.
    István, B., Jácint, S., Benczúr, A.A.: Latent dirichlet allocation in web spam filtering. In: AIRWeb 2008: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 29–32 (2008)Google Scholar
  5. 5.
    Bishop, C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)Google Scholar
  6. 6.
    Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003 (2003)Google Scholar
  7. 7.
    Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)MathSciNetGoogle Scholar
  8. 8.
    Breiman, L.: Bagging predictors. In: Machine Learning, pp. 123–140 (1996)Google Scholar
  9. 9.
    Cormack, G.V.: Spam track overview. In: TREC-2007: Sixteenth Text REtrieval Conference (2007)Google Scholar
  10. 10.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  11. 11.
    Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM, New York (2007)CrossRefGoogle Scholar
  12. 12.
    Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990)MATHGoogle Scholar
  13. 13.
    Goodman, J., Heckerman, D., Rounthwaite, R.: Stopping spam. Scientific American 292(4), 42–88 (2005)CrossRefGoogle Scholar
  14. 14.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Uncertainty in Artificial Intelligence, pp. 50–57 (1999)Google Scholar
  15. 15.
    Huang, T.S., Dagli, C.K., Rajaram, S., Chang, E.Y., Mandel, M.I., Poliner, G.E., Ellis, D.P.W.: Active learning for interactive multimedia retrieval. Proceedings of the IEEE 96(4), 648–667 (2008)CrossRefGoogle Scholar
  16. 16.
    Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Words vs. character n-grams for anti-spam filtering. Int. Journal on Artificial Intelligence Tools 16(6), 1047–1067 (2007)CrossRefGoogle Scholar
  17. 17.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  18. 18.
    Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Systems with Applications 36, 10206–10222 (2009)CrossRefGoogle Scholar
  19. 19.
    Yu, B., Xu, Z.-b.: A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Systems 21(4), 355–362 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Juan Carlos Gomez
    • 1
  • Marie-Francine Moens
    • 2
  1. 1.ITESMMonterreyMexico
  2. 2.Katholieke Universiteit LeuvenHeverleeBelgium

Personalised recommendations