Abstract
In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, C.C., Zhai, C.X.: Mining Text Data. Springer, New York (2012). ISBN 978-1-4614-3222-7
Aghdam, M.H., Aghaee, N.G., Basiri, M.E.: Text feature selection using ant colony optimization. Expert Syst. Appl. 36(2–3), 6843–6853 (2009)
Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst. Appl. 39, 4760–4768 (2012)
Bharti, K.K., Singh, P.K.: Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst. Appl. 42, 3105–3114 (2015)
Bharti, K.K., Singh, P.K.: Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl. Soft Comput. 43, 20–34 (2016)
Corrêa, G.N., Marcacini, R.M., Hruschka, E.R., Rezende, S.O.: Interactive textual feature selection for consensus clustering. Pattern Recogn. Lett. 52, 25–31 (2015)
Dadaneh, B.Z., Markid, H.Y., Zakerolhosseini, A.: Unsupervised probabilistic feature selection using ant colony optimization. Expert Syst. Appl. 53, 27–42 (2016)
Feng, G., Guo, J., Jing, B.Y., Hao, L.: A Bayesian feature selection paradigm for text classification. Inf. Process. Manage. 48, 283–302 (2012)
Fenga, G., Guoa, J., Jing, B.Y., Sunb, T.: Feature subset selection using naive Bayes for text classification. Pattern Recogn. Lett. 65, 109–115 (2015)
Ghareb, A.S., Bakar, A.A., Hamdan, A.R.: Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49, 31–47 (2016)
Guru, D.S., Harish, B.S., Manjunath, S.: Symbolic representation of text documents. In: Proceedings of the Third Annual ACM Bangalore Conference (COMPUTE 2010). ACM, New York (2010). Article 18, 4 pages
Guru, D.S., Nagendraswamy, H.S.: Symbolic representation of two-dimensional shapes. Pattern Recogn. Lett. 28, 144–155 (2006)
Guru, D.S., Prakash, H.N.: Online signature verification and recognition: an approach based on symbolic representation. IEEE TPAMI 31(6), 1059–1073 (2009)
Guru, D.S., Suhil, M.: A Novel Term_Class relevance measure for text categorization. Procedia Comput. Sci. 45, 13–22 (2015)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. JMLR 3, 1157–1182 (2003)
Harish, B.S., Guru, D.S., Manjunath, S.: A brief review. IJCA, Special Issue on RTIPPR 2, 110–119 (2010)
Hotho, A., Nurnberger, A., Paab, G.: A brief survey of text mining. J. Comput. Linguist. Lang. Technol. 20, 19–62 (2005)
Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE TKDE 20, 1264–1272 (2008)
Jiang, L., Li, C., Wang, S., Zhang, L.: Deep feature weighting for naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 52, 26–39 (2016)
Li, Y.H., Jain, A.K.: Classification of text documents. Comput. J. 41(8), 537–546 (1998)
Meena, M.J., Chandran, K.R., Karthik, A., Samuel, A.V.: An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Syst. Appl. 39, 5861–5871 (2012)
Moradi, P., Gholampour, M.: A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Appl. Soft Comput. 43, 117–130 (2016)
Nagendraswamy, H.S., Guru, D.S.: A new method of representing and matching two dimensional shapes. Int. J. Image Graph. 7(2), 377–405 (2007)
Pinheiro, R.H.W., Cavalcanti, G.D.C., Ren, T.I.: Data-driven global-ranking local feature selection methods for text categorization. Expert Syst. Appl. 42, 1941–1949 (2015)
Pinheiro, R.H.W., Cavalcanti, G.D.C., Correa, R.F., Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39, 12851–12857 (2012)
Punitha, P., Guru, D.S.: Symbolic image indexing and retrieval by spatial similarity: an approach based on B-tree. Pattern Recogn. 41(6), 2068–2085 (2008)
Rehman, A., et al.: Relative discrimination criterion - a novel feature ranking method for text data. Expert Syst. Appl. 42, 3670–3681 (2015)
Revanasidappa, M.B., Harish, B.S., Manjunath, S.: Document classification using symbolic classifiers. In: International Conference on Contemporary Computing and Informatics (IC3I), pp. 299–303 (2014)
Rigutini, L.: Machine learning techniques. Ph.D. thesis, University of Siena
Sarkar, S.D., Goswami, S., Agarwal, A., Aktar, J.: A Novel Feature Selection Technique for Text Classification Using Naive Bayes, pp. 1–10. Hindawi Publishing Corporation (2014)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Shang, C., Li, M., Feng, S., Jiang, Q., Fan, J.: Feature selection via maximizing global information gain for text classification. Knowl. Based Syst. 54, 298–309 (2013)
Tasci, S., Gungor, T.: Comparison of text feature selection policies and using an adaptive framework. Expert Syst. Appl. 40, 4871–4886 (2013)
Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)
Wang, D., Zhang, H., Li, R., Lv, W., Wang, D.: t-Test feature selection approach based on term frequency for text categorization. Pattern Recogn. Lett. 45, 1–10 (2014)
Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf. Process. Manage. 48, 741–754 (2012)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 412–420 (1997)
Zhang, L., Jiang, L., Li, C., Kong, G.: Two feature weighting approaches for naive Bayes text classifiers. Knowl. Based Syst. 100, 137–144 (2016)
Zong, W., Wu, F., Chu, L.K., Sculli, D.: A discriminative and semantic feature selection method for text categorization. Int. J. Prod. Econ. 165, 215–222 (2015)
Acknowledgements
The second author of this paper acknowledges the financial support rendered by the University of Mysore under UPE grants for the High Performance Computing laboratory. The first and fourth authors of this paper acknowledge the financial support rendered by Pillar4 Company, Bangalore.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Raju, L.N., Suhil, M., Guru, D.S., Gowda, H.S. (2017). Cluster Based Symbolic Representation for Skewed Text Categorization. In: Santosh, K., Hangarge, M., Bevilacqua, V., Negi, A. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2016. Communications in Computer and Information Science, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-4859-3_19
Download citation
DOI: https://doi.org/10.1007/978-981-10-4859-3_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-4858-6
Online ISBN: 978-981-10-4859-3
eBook Packages: Computer ScienceComputer Science (R0)