Cluster Based Symbolic Representation for Skewed Text Categorization

Raju, Lavanya Narayana; Suhil, Mahamad; Guru, D. S.; Gowda, Harsha S.

doi:10.1007/978-981-10-4859-3_19

Lavanya Narayana Raju¹⁴,
Mahamad Suhil¹⁴,
D. S. Guru¹⁴ &
…
Harsha S. Gowda¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 709))

Included in the following conference series:

International Conference on Recent Trends in Image Processing and Pattern Recognition

793 Accesses
3 Citations
3 Altmetric

Abstract

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, C.C., Zhai, C.X.: Mining Text Data. Springer, New York (2012). ISBN 978-1-4614-3222-7
Book Google Scholar
Aghdam, M.H., Aghaee, N.G., Basiri, M.E.: Text feature selection using ant colony optimization. Expert Syst. Appl. 36(2–3), 6843–6853 (2009)
Article Google Scholar
Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst. Appl. 39, 4760–4768 (2012)
Article Google Scholar
Bharti, K.K., Singh, P.K.: Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst. Appl. 42, 3105–3114 (2015)
Article Google Scholar
Bharti, K.K., Singh, P.K.: Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl. Soft Comput. 43, 20–34 (2016)
Article Google Scholar
Corrêa, G.N., Marcacini, R.M., Hruschka, E.R., Rezende, S.O.: Interactive textual feature selection for consensus clustering. Pattern Recogn. Lett. 52, 25–31 (2015)
Article Google Scholar
Dadaneh, B.Z., Markid, H.Y., Zakerolhosseini, A.: Unsupervised probabilistic feature selection using ant colony optimization. Expert Syst. Appl. 53, 27–42 (2016)
Article Google Scholar
Feng, G., Guo, J., Jing, B.Y., Hao, L.: A Bayesian feature selection paradigm for text classification. Inf. Process. Manage. 48, 283–302 (2012)
Article Google Scholar
Fenga, G., Guoa, J., Jing, B.Y., Sunb, T.: Feature subset selection using naive Bayes for text classification. Pattern Recogn. Lett. 65, 109–115 (2015)
Article Google Scholar
Ghareb, A.S., Bakar, A.A., Hamdan, A.R.: Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49, 31–47 (2016)
Article Google Scholar
Guru, D.S., Harish, B.S., Manjunath, S.: Symbolic representation of text documents. In: Proceedings of the Third Annual ACM Bangalore Conference (COMPUTE 2010). ACM, New York (2010). Article 18, 4 pages
Google Scholar
Guru, D.S., Nagendraswamy, H.S.: Symbolic representation of two-dimensional shapes. Pattern Recogn. Lett. 28, 144–155 (2006)
Article Google Scholar
Guru, D.S., Prakash, H.N.: Online signature verification and recognition: an approach based on symbolic representation. IEEE TPAMI 31(6), 1059–1073 (2009)
Article Google Scholar
Guru, D.S., Suhil, M.: A Novel Term_Class relevance measure for text categorization. Procedia Comput. Sci. 45, 13–22 (2015)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. JMLR 3, 1157–1182 (2003)
MATH Google Scholar
Harish, B.S., Guru, D.S., Manjunath, S.: A brief review. IJCA, Special Issue on RTIPPR 2, 110–119 (2010)
Google Scholar
Hotho, A., Nurnberger, A., Paab, G.: A brief survey of text mining. J. Comput. Linguist. Lang. Technol. 20, 19–62 (2005)
Google Scholar
Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE TKDE 20, 1264–1272 (2008)
Google Scholar
Jiang, L., Li, C., Wang, S., Zhang, L.: Deep feature weighting for naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 52, 26–39 (2016)
Article Google Scholar
Li, Y.H., Jain, A.K.: Classification of text documents. Comput. J. 41(8), 537–546 (1998)
Article MATH Google Scholar
Meena, M.J., Chandran, K.R., Karthik, A., Samuel, A.V.: An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Syst. Appl. 39, 5861–5871 (2012)
Article Google Scholar
Moradi, P., Gholampour, M.: A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Appl. Soft Comput. 43, 117–130 (2016)
Article Google Scholar
Nagendraswamy, H.S., Guru, D.S.: A new method of representing and matching two dimensional shapes. Int. J. Image Graph. 7(2), 377–405 (2007)
Article Google Scholar
Pinheiro, R.H.W., Cavalcanti, G.D.C., Ren, T.I.: Data-driven global-ranking local feature selection methods for text categorization. Expert Syst. Appl. 42, 1941–1949 (2015)
Article Google Scholar
Pinheiro, R.H.W., Cavalcanti, G.D.C., Correa, R.F., Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39, 12851–12857 (2012)
Article Google Scholar
Punitha, P., Guru, D.S.: Symbolic image indexing and retrieval by spatial similarity: an approach based on B-tree. Pattern Recogn. 41(6), 2068–2085 (2008)
Article MATH Google Scholar
Rehman, A., et al.: Relative discrimination criterion - a novel feature ranking method for text data. Expert Syst. Appl. 42, 3670–3681 (2015)
Article Google Scholar
Revanasidappa, M.B., Harish, B.S., Manjunath, S.: Document classification using symbolic classifiers. In: International Conference on Contemporary Computing and Informatics (IC3I), pp. 299–303 (2014)
Google Scholar
Rigutini, L.: Machine learning techniques. Ph.D. thesis, University of Siena
Google Scholar
Sarkar, S.D., Goswami, S., Agarwal, A., Aktar, J.: A Novel Feature Selection Technique for Text Classification Using Naive Bayes, pp. 1–10. Hindawi Publishing Corporation (2014)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Shang, C., Li, M., Feng, S., Jiang, Q., Fan, J.: Feature selection via maximizing global information gain for text classification. Knowl. Based Syst. 54, 298–309 (2013)
Article Google Scholar
Tasci, S., Gungor, T.: Comparison of text feature selection policies and using an adaptive framework. Expert Syst. Appl. 40, 4871–4886 (2013)
Article Google Scholar
Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)
Article Google Scholar
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)
Article Google Scholar
Wang, D., Zhang, H., Li, R., Lv, W., Wang, D.: t-Test feature selection approach based on term frequency for text categorization. Pattern Recogn. Lett. 45, 1–10 (2014)
Article Google Scholar
Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf. Process. Manage. 48, 741–754 (2012)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 412–420 (1997)
Google Scholar
Zhang, L., Jiang, L., Li, C., Kong, G.: Two feature weighting approaches for naive Bayes text classifiers. Knowl. Based Syst. 100, 137–144 (2016)
Article Google Scholar
Zong, W., Wu, F., Chu, L.K., Sculli, D.: A discriminative and semantic feature selection method for text categorization. Int. J. Prod. Econ. 165, 215–222 (2015)
Article Google Scholar

Download references

Acknowledgements

The second author of this paper acknowledges the financial support rendered by the University of Mysore under UPE grants for the High Performance Computing laboratory. The first and fourth authors of this paper acknowledge the financial support rendered by Pillar4 Company, Bangalore.

Author information

Authors and Affiliations

Department of Studies in Computer Science, University of Mysore, Mysore, India
Lavanya Narayana Raju, Mahamad Suhil, D. S. Guru & Harsha S. Gowda

Authors

Lavanya Narayana Raju
View author publications
You can also search for this author in PubMed Google Scholar
Mahamad Suhil
View author publications
You can also search for this author in PubMed Google Scholar
D. S. Guru
View author publications
You can also search for this author in PubMed Google Scholar
Harsha S. Gowda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahamad Suhil .

Editor information

Editors and Affiliations

The University of South Dakota, Vermillion, South Dakota, USA
K.C. Santosh
Karnatak Arts, Science and Commerce College, Bidar, India
Mallikarjun Hangarge
Polytecnico di Bari, Bari, Italy
Vitoantonio Bevilacqua
University of Hyderabad, Hyderabad, India
Atul Negi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Raju, L.N., Suhil, M., Guru, D.S., Gowda, H.S. (2017). Cluster Based Symbolic Representation for Skewed Text Categorization. In: Santosh, K., Hangarge, M., Bevilacqua, V., Negi, A. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2016. Communications in Computer and Information Science, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-4859-3_19

Download citation

DOI: https://doi.org/10.1007/978-981-10-4859-3_19
Published: 29 April 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-4858-6
Online ISBN: 978-981-10-4859-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cluster Based Symbolic Representation for Skewed Text Categorization