Abstract
We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases the accuracy at the same time. The test example is classified using simpler and smaller model. The training examples in a particular cluster share the common vocabulary. At the time of clustering, we do not take into account the labels of the training examples. After the clusters have been created, the classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The terms categorization and classification will be used interchangeably throughout the paper.
References
Appavu, S., Rajaram, R.: Learning to classify threatening e-mail. Int. J. Artif. Intell. Soft Comput. 1, 39–51 (2008)
Backer, L.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR’98. ACM (1998)
Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report. (2005)
Brown, P.F, deSouza, P.V., Mercer, Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Collection of Methods to Analyze the text. http://code.google.com/p/text-analysis/
Dumais, S., Platt, J., Hackerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. CIKM’98. ACM. (1998)
Freund, Y., Schapire, R.E: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)
Hall, M., Frank, E., Holmes, G., Pfahringer,B., Reutemann, P., Ian H. Witten, I. H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, vol. 11(1). (2009)
Jing, L., Huang, J., Michael K. Ng., Rong, H.: A feature weighting approach to building classification models by interactive clustering. LNAI, pp. 284–294. Springer, Berlin (2004)
Joachims, T: A Statistical Learning Model of Text Classification for Support Vector Machines. In: 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (2001)
Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: 30th annual international ACM SIGIR 07, conference on Research and development in information retrieval. (2007)
Kyriakopoulou, A., Kalamboukis, T.: Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems. RSDC, 2008
Kyriakopoulou, A.: Text Classification Aided by Clustering: A Literature Review. I-Tech Education and Publishing KG, Vienna, Austria (2008)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th International Annual Conference SIGR’92, pp. 37–50. (1992)
Li, Y., Hung, E.: Building a decision cluster forest model to classify high dimensional data with multi-classes. LNAI, pp. 263–277. Springer, Berlin (2009)
Li, Y., Hung, E., Chung, K., Huang, J.: Building a decision cluster classification model for high dimensional data by a variable weighting K-means method. In: AI ’08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence. (2008)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. Technical Report. Workshop on Learning for Text Categorization, pp. 41–48. (1998)
Moore, J., Hong, E., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. (1997)
National Commission on Terrorist Attacks Upon the United States. http://govinfo.library.unt.edu/911/report/911Report.pdf, (2004). Accessed on 25-08-2010
Nizamani, S., Memon, N., Wiil, U.K.: Detecting suspicious emails using improved features. In: IEEE International Conference on Modeling and Simulation Control, pp. 232–236. (2010)
Quinlan, J.R.: Induction of Decision Trees. J. Mach. Learn. 1, 81–106 (1986)
Quinlan, J.R.: C4.5: Programs for machine learning. Machine Learning, vol. 16, pp. 235–240. Springer, Berlin (1993)
Renuka, D.K., Hamsapriya, T.: Email Classification for Spam Detection using Word Stemming. Int. J. Comput. Appl. 1, 45–47 (2010)
Sebastani, F.: Machine Learning in Automated Text Categorization. ACM Comput. surv. 34(1), 1–47 (2002)
Schapire, R.E., Singer, Y.: Boostexter: A boosting based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)
Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: 5th National Conference on Artificial Intelligence, pp. 496–501. (1986)
Tan, P.N., Michael Steinbach, Vipin Kumar: Introduction to Data Mining. pp. 490–530. (2006)
Utgoff, P.E: ID5: An incremental ID3. In: 5th International Conference on Machine Learning, pp. 107–120. (1988)
Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)
Utgoff, P.E., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Mach. Learn. 29, 5–44 (1997)
Vapnik, V.: The Nature of Statistical Theory. Springer, Berlin (1995)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Survey paper, Springer, Berlin (2007)
Yang, Y., Pederson, J.: Feature selection in statistical learning of text categorization. In: ZCML-97, pp. 412–420. (1997)
Yong, Z., Youwen, L., Shixiong, X.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Zeng, H.J., Wang, X.H., Chen, Z., Ying, W.: CBC: Clustering based text classification. Requiring minimal labeled data. In: 3rd IEEE International Conference on Data Mining. (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag/Wien
About this chapter
Cite this chapter
Nizamani, S., Memon, N., Wiil, U.K. (2011). Cluster Based Text Classification Model. In: Wiil, U.K. (eds) Counterterrorism and Open Source Intelligence. Lecture Notes in Social Networks. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0388-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-7091-0388-3_14
Published:
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-0387-6
Online ISBN: 978-3-7091-0388-3
eBook Packages: Computer ScienceComputer Science (R0)