Cluster Based Text Classification Model

  • Sarwat Nizamani
  • Nasrullah Memon
  • Uffe Kock Wiil
Part of the Lecture Notes in Social Networks book series (LNSN)


We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases the accuracy at the same time. The test example is classified using simpler and smaller model. The training examples in a particular cluster share the common vocabulary. At the time of clustering, we do not take into account the labels of the training examples. After the clusters have been created, the classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset.


Support Vector Machine Classification Model Text Categorization Current Cluster Text Classification Task 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Appavu, S., Rajaram, R.: Learning to classify threatening e-mail. Int. J. Artif. Intell. Soft Comput. 1, 39–51 (2008)CrossRefGoogle Scholar
  2. 2.
    Backer, L.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR’98. ACM (1998)Google Scholar
  3. 3.
    Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report. (2005)Google Scholar
  4. 4.
    Brown, P.F, deSouza, P.V., Mercer, Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)Google Scholar
  5. 5.
    Collection of Methods to Analyze the text.
  6. 6.
    Dumais, S., Platt, J., Hackerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. CIKM’98. ACM. (1998)Google Scholar
  7. 7.
    Freund, Y., Schapire, R.E: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)Google Scholar
  8. 8.
    Hall, M., Frank, E., Holmes, G., Pfahringer,B., Reutemann, P., Ian H. Witten, I. H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, vol. 11(1). (2009)Google Scholar
  9. 9.
    Jing, L., Huang, J., Michael K. Ng., Rong, H.: A feature weighting approach to building classification models by interactive clustering. LNAI, pp. 284–294. Springer, Berlin (2004)Google Scholar
  10. 10.
    Joachims, T: A Statistical Learning Model of Text Classification for Support Vector Machines. In: 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (2001)Google Scholar
  11. 11.
    Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: 30th annual international ACM SIGIR 07, conference on Research and development in information retrieval. (2007)Google Scholar
  12. 12.
    Kyriakopoulou, A., Kalamboukis, T.: Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems. RSDC, 2008Google Scholar
  13. 13.
    Kyriakopoulou, A.: Text Classification Aided by Clustering: A Literature Review. I-Tech Education and Publishing KG, Vienna, Austria (2008)Google Scholar
  14. 14.
    Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th International Annual Conference SIGR’92, pp. 37–50. (1992)Google Scholar
  15. 15.
    Li, Y., Hung, E.: Building a decision cluster forest model to classify high dimensional data with multi-classes. LNAI, pp. 263–277. Springer, Berlin (2009)Google Scholar
  16. 16.
    Li, Y., Hung, E., Chung, K., Huang, J.: Building a decision cluster classification model for high dimensional data by a variable weighting K-means method. In: AI ’08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence. (2008)Google Scholar
  17. 17.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. Technical Report. Workshop on Learning for Text Categorization, pp. 41–48. (1998)Google Scholar
  18. 18.
    Moore, J., Hong, E., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. (1997)Google Scholar
  19. 19.
    National Commission on Terrorist Attacks Upon the United States., (2004). Accessed on 25-08-2010
  20. 20.
    Nizamani, S., Memon, N., Wiil, U.K.: Detecting suspicious emails using improved features. In: IEEE International Conference on Modeling and Simulation Control, pp. 232–236. (2010)Google Scholar
  21. 21.
    Quinlan, J.R.: Induction of Decision Trees. J. Mach. Learn. 1, 81–106 (1986)Google Scholar
  22. 22.
    Quinlan, J.R.: C4.5: Programs for machine learning. Machine Learning, vol. 16, pp. 235–240. Springer, Berlin (1993)Google Scholar
  23. 23.
    Renuka, D.K., Hamsapriya, T.: Email Classification for Spam Detection using Word Stemming. Int. J. Comput. Appl. 1, 45–47 (2010)Google Scholar
  24. 24.
    Sebastani, F.: Machine Learning in Automated Text Categorization. ACM Comput. surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  25. 25.
    Schapire, R.E., Singer, Y.: Boostexter: A boosting based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)CrossRefzbMATHGoogle Scholar
  26. 26.
    Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: 5th National Conference on Artificial Intelligence, pp. 496–501. (1986)Google Scholar
  27. 27.
    Tan, P.N., Michael Steinbach, Vipin Kumar: Introduction to Data Mining. pp. 490–530. (2006)Google Scholar
  28. 28.
    Utgoff, P.E: ID5: An incremental ID3. In: 5th International Conference on Machine Learning, pp. 107–120. (1988)Google Scholar
  29. 29.
    Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)CrossRefGoogle Scholar
  30. 30.
    Utgoff, P.E., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Mach. Learn. 29, 5–44 (1997)CrossRefzbMATHGoogle Scholar
  31. 31.
    Vapnik, V.: The Nature of Statistical Theory. Springer, Berlin (1995)CrossRefzbMATHGoogle Scholar
  32. 32.
    Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Survey paper, Springer, Berlin (2007)Google Scholar
  33. 33.
    Yang, Y., Pederson, J.: Feature selection in statistical learning of text categorization. In: ZCML-97, pp. 412–420. (1997)Google Scholar
  34. 34.
    Yong, Z., Youwen, L., Shixiong, X.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)Google Scholar
  35. 35.
    Zeng, H.J., Wang, X.H., Chen, Z., Ying, W.: CBC: Clustering based text classification. Requiring minimal labeled data. In: 3rd IEEE International Conference on Data Mining. (2003)Google Scholar

Copyright information

© Springer-Verlag/Wien 2011

Authors and Affiliations

  • Sarwat Nizamani
    • 1
    • 2
  • Nasrullah Memon
    • 1
    • 3
  • Uffe Kock Wiil
    • 1
  1. 1.Counterterrorism Research Lab, The Maersk Mc-Kinney Moller InstituteUniversity of Southern DenmarkOdenseDenmark
  2. 2.University of SindhJamshoroPakistan
  3. 3.Hellenic American UniversityManchesterUSA

Personalised recommendations