The Enron Corpus: A New Dataset for Email Classification Research

  • Bryan Klimt
  • Yiming Yang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3201)

Abstract

Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brutlag, J.D., Meek, C.: Challenges of the Email Domain for Text Classification. In: ICML 2000, pp. 103–110 (2000)Google Scholar
  2. 2.
    Cohen, W.W.: Learning Rules that classify E-mail. In: Proc. of the 1996 AAAI Spring Symposium in Information Access (1996)Google Scholar
  3. 3.
    Crawford, E., Kay, J., McCreath, E.: Automatic Induction of Rules for e-mail Classification. In: ADCS 2001 Proceedings of the Sixth Australasian Document Computing Symposium, Coffs Harbour, NSW Australia, pp. 13–20 (2001)Google Scholar
  4. 4.
    Diao, Y., Lu, H., Wu, D.: A comparative study of classification-based personal e-mail filtering. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 408–419. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Hung, E.: Deduction of Procmail Recipes from Classified Emails. CMSC724 Database Management Systems, individual research project report (May 2001)Google Scholar
  6. 6.
    Kiritchenko, S., Matwin, S.: Email classification with co-training. In: Proc. of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Ontario, Canada, p. 8 (2001)Google Scholar
  7. 7.
    Lewis, D.D., Knowles, K.A.: Threading Electronic Mail: A Preliminary Study. Information Processing and Management 33(2), 209–217 (1997)CrossRefGoogle Scholar
  8. 8.
    Manco, G., Masciari, E., Ruffolo, M., Tagarelli, A.: Towards an Adaptive Mail Classifier. In: AIIA 2002 (September 2002)Google Scholar
  9. 9.
    Murakoshi, H., Shimazu, A., Ochimizu, K.: Construction of Deliberation Structure in Email Communication. In: Pacific Association for Computational Linguistics (PACLING 1999), August 1999, pp. 16–28 (1999)Google Scholar
  10. 10.
    Rennie, J.: ifile: An Application of Machine Learning to E-Mail Filtering. In: Proc. KDD 2000 Workshop on Text Mining, Boston (2000)Google Scholar
  11. 11.
    Segal, R.B., Kephart, J.O.: MailCat: An Intelligent Assistant for Organizing E-Mail. In: Proc. of the 3rd International Conference on Autonomous Agents (1999)Google Scholar
  12. 12.
    Yang, Y.: A Study of Thresholding Strategies for Text Categorization. In: Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 137–145 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Bryan Klimt
    • 1
  • Yiming Yang
    • 1
  1. 1.Language Technologies InstituteCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations