Journal of Intelligent Information Systems

, Volume 30, Issue 2, pp 153–181 | Cite as

Mining categories for emails via clustering and pattern discovery

Article

Abstract

The continuous exchange of information by means of the popular email service has raised the problem of managing the huge amounts of messages received from users in an effective and efficient way. We deal with the problem of email classification by conceiving suitable strategies for: (1) organizing messages into homogeneous groups, (2) redirecting further incoming messages according to an initial organization, and (3) building reliable descriptions of the message groups discovered. We propose a unified framework for handling and classifying email messages. In our framework, messages sharing similar features are clustered in a folder organization. Clustering and pattern discovery techniques for mining structured and unstructured information from email messages are the basis of an overall process of folder creation/maintenance and email redirection. Pattern discovery is also exploited for generating suitable cluster descriptions that play a leading role in cluster updating. Experimental evaluation performed on several personal mailboxes shows the effectiveness of our approach.

Keywords

Email classification Text mining Clustering Pattern discovery 

Abbreviations

H.2.8 (Database Management)

Database Applications–Data Mining

I.5.3 (Pattern Recognition)

Clustering–Algorithms, Similarity measures

I.5.4 (Pattern Recognition)

Applications–Text processing

H.4.3 (Information Systems Applications)

Communications Applications–electronic mail

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Bayardo, R., & Srikant, R. (2000). ATHENA: Mining-based interactive management of text databases. In Proceedings of the International Conference on Extending Database Technology (EDBT) (pp. 365–379). Konstanz, Germany.Google Scholar
  2. Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).Google Scholar
  3. Allan, J., Papka, R., & Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR)(pp. 37–45). Melbourne, Australia.Google Scholar
  4. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An Evaluation of naive Bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age (pp. 9–17). Barcelona, Spain.Google Scholar
  5. Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval, ISBN-0-201-39829-X. New York: ACM.Google Scholar
  6. Boone, G. (1998). Concept features in re: Agent, an intelligent e-mail agent. In Proceedings of the International Conference on Autonomous Agents. (pp. 141–148). Minneapolis: ACM.CrossRefGoogle Scholar
  7. Cohen, W. (1996). Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium in Information Access. Stanford, California.Google Scholar
  8. Crawford, E., Kay, J., & McCreath, E. (2001). Automatic induction of rules for e-mail classification. In Proceedings of the Australasian Document Computing Symposium (pp. 13–20). Coffs Harbour, NSW Australia.Google Scholar
  9. Cutting, D., David, K., Pedersen, J., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 318–329). Copenhagen, Denmark.Google Scholar
  10. Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse data using clustering. Machine Learning, 42, 143–175.MATHCrossRefGoogle Scholar
  11. Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2/3), 103–130.MATHCrossRefGoogle Scholar
  12. Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.CrossRefGoogle Scholar
  13. Fisher, D. (1987). Concept acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.Google Scholar
  14. Gennari, J., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.CrossRefGoogle Scholar
  15. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Dallas, Texas (pp. 1–12). New York: ACM.Google Scholar
  16. Hidalgo, J., López, M., & Sanz, E. (2000). Combining text and heuristics for cost-sensitive spam filtering. In Proceedings of the Computational Natural Language Learning Workshop (CoNLL) (pp. 99–102). Lisbon, Portugal.Google Scholar
  17. Huang, Z. (1998). Extensions to the k-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.CrossRefGoogle Scholar
  18. Jain, A. & Dubes, R. (1988). Algorithms for clustering data, Prentice-Hall advanced reference series. Englewood Cliffs, New Jersey: Prentice-Hall.MATHGoogle Scholar
  19. Jain, A., Murthy, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.CrossRefGoogle Scholar
  20. Kilander, F., Fahraeus, E., & Palme, J. (1997). Intelligent information filtering. Technical report, Department of Computer and Systems Sciences, Stockholm University. Available at http://www.dsv.su.se/~fk/if_Doc/IntFilter.html.
  21. Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology. San Diego, California.Google Scholar
  22. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conf on Machine Learning (ECML). (pp. 4–15). Berlin Heidelberg New York: Springer.Google Scholar
  23. Lewis, D. D., & Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). (pp. 3–12). Berlin Heidelberg New York: Springer.Google Scholar
  24. Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR) (pp. 81–93).Google Scholar
  25. McCallum, A., & Nigam, K. (1998). A Comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization (pp. 41–48). Madison, Wisconsin.Google Scholar
  26. Mitchell, T. (1997). Machine Learning, Computer Sciences Series. New York: McGraw-Hill.MATHGoogle Scholar
  27. Mock, K. (1999). Dynamic email organization via relevance categories. In Proceedings of the IEEE International Conference on Tools With Artificial Intelligence (ICTAI) (pp. 399–405). Chicago, Illinois.Google Scholar
  28. Pantel, P., & Lin, D. (1998). SpamCop: A spam classification and organization program. In Proceedings of the AAAI Workshop on Learning For Text Categorization (pp. 95–98). Madison, Wisconsin.Google Scholar
  29. Payne, T. R., & Edwards, P. (1997). Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1), 1–32.CrossRefGoogle Scholar
  30. Segal, R., & Kephart, J. (1999). MailCat: An intelligent assistant for organizing e-mail. In Proceedings of the International Conference on Autonomous Agents. Seattle, Washington. (pp. 276–282). New York: ACM.Google Scholar
  31. Selim, S. Z., & Ismail, M. A. (1984). K-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.MATHCrossRefGoogle Scholar
  32. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceedings of the ACM SIGKDD International Workshop on Text Mining. Boston, Massachusetts.Google Scholar
  33. Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Proceedings of the AAAI workshop on artificial intelligence for web search, Austin, Texas. (pp. 58–64). California: AAAI.Google Scholar
  34. Swan, R., & Allan, J. (2000). Automatic generation of overview timelines. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). Athens, Greece (pp. 49–56). New York: ACM.Google Scholar
  35. Whittaker, S., & Sidner, C. (1996). Email overload: exploring personal information management of email. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). (pp. 276–283). New York: ACM.Google Scholar
  36. Yang, Y., Pierce, T., & Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 28–36). Melbourne, Australia.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Giuseppe Manco
    • 1
  • Elio Masciari
    • 1
  • Andrea Tagarelli
    • 2
  1. 1.ICAR-CNRRende (CS)Italy
  2. 2.DEISUniversity of CalabriaRende (CS)Italy

Personalised recommendations