A Comparative Study of Classification Based Personal E-mail Filtering
This paper addresses personal E-mail filtering by casting it in the framework of text classification. Modeled as semi-structured documents, E-mail messages consist of a set of fields with predefined semantics and a number of variable length free-text fields. While most work on classification either concentrates on structured data or free text, the work in this paper deals with both of them. To perform classification, a naive Bayesian classifier was designed and implemented, and a decision tree based classifier was implemented. The design considerations and implementation issues are discussed. Using a relatively large amount of real personal E-mail data, a comprehensive comparative study was conducted using the two classifiers. The importance of different features is reported. Results of other issues related to building an effective personal E-mail classifier are presented and discussed. It is shown that both classifiers can perform filtering with reasonable accuracy. While the decision tree based classifier outperforms the Bayesian classifier when features and training size are selected optimally for both, a carefully designed naive Bayesian classifier is more robust.
Unable to display preview. Download preview PDF.
- 1.William W. Cohen: Learning Rules that Classify E-mail. In Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access Google Scholar
- 2.W. W. Cohen, Y. Singer: Context-Sensitive Learning Methods for Text Categorization. In Proceedings of SIGIR-1996 Google Scholar
- 3.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery: Learning to Extract Symbolic Knowledge from the World Wide Web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98)Google Scholar
- 4.Fredrik Kilander: Properties of Electronic Texts for Classification Purposes as Suggested by Users. http://www.dsv.su.se/~fk/if_Doc/F25/essays.ps.Z
- 5.D. D. Lewis: Naïve (Bayes) at Forty: The Independent Assumption in Information Retrieval. In European Conference on Machine Learning, 1998Google Scholar
- 7.D. D. Lewis, M. Ringuette: A Comparison of Two Learning Algorithms for Text Categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93, Las Vegas, NVGoogle Scholar
- 8.Andrew McCallum and Kamal Nigam: A Comparison of Event Models for Naive Bayes Text Classification. Working notes of the 1998 AAAI/ICML workshop on Learning for Text Categorization Google Scholar
- 9.J. R. Quinlan: Induction of Decision Trees. Machine Learning, 1: 81–106, 1986Google Scholar
- 10.J. R. Quinlan: C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann Publishers, 1993Google Scholar
- 11.M. Sahami, S. Dumais, D. Heckerman, E. Horvitz: A Bayesian Approach to Filtering Junk E-mail. In Learning for Text Categorization: Papers from the 1998 workshop. AAAI Technical Report WS-98-05Google Scholar
- 12.Ellen Spertus: Smokey: Automatic Recognition of Hostile Messages. In Proceedings of Innovative Applications of Artificial Intelligence (IAAI) 1997Google Scholar