Online Evaluation of Email Streaming Classifiers Using GNUsmail

  • José M. Carmona-Cejudo
  • Manuel Baena-García
  • José del Campo-Ávila
  • Albert Bifet
  • João Gama
  • Rafael Morales-Bueno
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7014)

Abstract

Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.

Keywords

Email Classification Online Methods Concept Drift Text Mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aberdeen, D., Pacovsky, O., Slater, A.: AIM: The learning behind gmail priority inbox. Tech. rep., Google Inc. (2010)Google Scholar
  2. 2.
    Barrett, R., Selker, T.: AIM: A new approach for meeting information needs. Tech. rep., IBM Almaden Research Center, Almaden, CA (1995)Google Scholar
  3. 3.
    Bekkerman, R., Mccallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on Enron and SRI Corpora. Tech. rep., Center for Intelligent Information Retrieval (2004)Google Scholar
  4. 4.
    Bermejo, P., Gámez, J.A., Puerta, J.M., Uribe-Paredes, R.: Improving KNN-based e-mail classification into folders generating class-balanced datasets. In: Proceedings of the 12th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Sytems (IPMU 2008), pp. 529–536 (2008)Google Scholar
  5. 5.
    Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009), pp. 139–148 (2009)Google Scholar
  6. 6.
    Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. Journal of Machine Learning Research - Proceedings Track 11, 44–50 (2010)Google Scholar
  7. 7.
    Carmona-Cejudo, J.M., Baena-García, M., del Campo-Ávila, J., Bueno, R.M., Bifet, A.: Gnusmail: Open framework for on-line email classification. In: ECAI, pp. 1141–1142 (2010)Google Scholar
  8. 8.
    Chaudhry, N., Shaw, K., Abdelguerfi, M. (eds.): Stream Data Management. Advances in Database Systems. Springer, Heidelberg (2005)MATHGoogle Scholar
  9. 9.
    Cohen, W.: Learning rules that classify e-mail. In: Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25 (1996), citeseer.ist.psu.edu/406441.html
  10. 10.
    Crawford, E., Kay, J., McCreath, E.: IEMS - the intelligent email sorter. In: Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pp. 83–90 (2002)Google Scholar
  11. 11.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Knowledge Discovery and Data Mining, pp. 71–80 (2000), citeseer.ist.psu.edu/article/domingos00mining.html
  12. 12.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)MATHGoogle Scholar
  13. 13.
    Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)CrossRefMATHGoogle Scholar
  14. 14.
    Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  15. 15.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), 10–18 (2009)CrossRefGoogle Scholar
  16. 16.
    Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Proceedings of the 15th European Conference on Machine Learning, ECML 2004 (2004)Google Scholar
  17. 17.
    Maes, P.: Agents that reduce work and information overload. Communications of the ACM 37(7), 30–40 (1994)CrossRefGoogle Scholar
  18. 18.
    Manco, G., Masciari, E., Tagarelli, A.: A framework for adaptive mail classification. In: Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2002), pp. 387–392 (2002)Google Scholar
  19. 19.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2003)MATHGoogle Scholar
  20. 20.
    Martin, B.: Instance-Based Learning: Nearest Neighbour with Generalization. Master’s thesis, University of Waikato (1995)Google Scholar
  21. 21.
    Pantel, P., Lin, D.: SpamCop: A spam classification & organization program. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization, pp. 95–98 (1998)Google Scholar
  22. 22.
    Rennie, J.D.M.: ifile: An application of machine learning to e-mail filtering. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2000) Text Mining Workshop (2000)Google Scholar
  23. 23.
  24. 24.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefGoogle Scholar
  25. 25.
    Segal, R.B., Kephart, J.O.: Incremental learning in SwiftFile. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 863–870 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • José M. Carmona-Cejudo
    • 1
  • Manuel Baena-García
    • 1
  • José del Campo-Ávila
    • 1
  • Albert Bifet
    • 2
  • João Gama
    • 3
  • Rafael Morales-Bueno
    • 1
  1. 1.Universidad de MálagaSpain
  2. 2.University of WaikatoNew Zealand
  3. 3.University of PortoPortugal

Personalised recommendations