Skip to main content

Cluster Based Symbolic Representation for Skewed Text Categorization

  • Conference paper
  • First Online:
Recent Trends in Image Processing and Pattern Recognition (RTIP2R 2016)

Abstract

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, C.C., Zhai, C.X.: Mining Text Data. Springer, New York (2012). ISBN 978-1-4614-3222-7

    Book  Google Scholar 

  2. Aghdam, M.H., Aghaee, N.G., Basiri, M.E.: Text feature selection using ant colony optimization. Expert Syst. Appl. 36(2–3), 6843–6853 (2009)

    Article  Google Scholar 

  3. Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst. Appl. 39, 4760–4768 (2012)

    Article  Google Scholar 

  4. Bharti, K.K., Singh, P.K.: Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst. Appl. 42, 3105–3114 (2015)

    Article  Google Scholar 

  5. Bharti, K.K., Singh, P.K.: Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl. Soft Comput. 43, 20–34 (2016)

    Article  Google Scholar 

  6. Corrêa, G.N., Marcacini, R.M., Hruschka, E.R., Rezende, S.O.: Interactive textual feature selection for consensus clustering. Pattern Recogn. Lett. 52, 25–31 (2015)

    Article  Google Scholar 

  7. Dadaneh, B.Z., Markid, H.Y., Zakerolhosseini, A.: Unsupervised probabilistic feature selection using ant colony optimization. Expert Syst. Appl. 53, 27–42 (2016)

    Article  Google Scholar 

  8. Feng, G., Guo, J., Jing, B.Y., Hao, L.: A Bayesian feature selection paradigm for text classification. Inf. Process. Manage. 48, 283–302 (2012)

    Article  Google Scholar 

  9. Fenga, G., Guoa, J., Jing, B.Y., Sunb, T.: Feature subset selection using naive Bayes for text classification. Pattern Recogn. Lett. 65, 109–115 (2015)

    Article  Google Scholar 

  10. Ghareb, A.S., Bakar, A.A., Hamdan, A.R.: Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49, 31–47 (2016)

    Article  Google Scholar 

  11. Guru, D.S., Harish, B.S., Manjunath, S.: Symbolic representation of text documents. In: Proceedings of the Third Annual ACM Bangalore Conference (COMPUTE 2010). ACM, New York (2010). Article 18, 4 pages

    Google Scholar 

  12. Guru, D.S., Nagendraswamy, H.S.: Symbolic representation of two-dimensional shapes. Pattern Recogn. Lett. 28, 144–155 (2006)

    Article  Google Scholar 

  13. Guru, D.S., Prakash, H.N.: Online signature verification and recognition: an approach based on symbolic representation. IEEE TPAMI 31(6), 1059–1073 (2009)

    Article  Google Scholar 

  14. Guru, D.S., Suhil, M.: A Novel Term_Class relevance measure for text categorization. Procedia Comput. Sci. 45, 13–22 (2015)

    Article  Google Scholar 

  15. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. JMLR 3, 1157–1182 (2003)

    MATH  Google Scholar 

  16. Harish, B.S., Guru, D.S., Manjunath, S.: A brief review. IJCA, Special Issue on RTIPPR 2, 110–119 (2010)

    Google Scholar 

  17. Hotho, A., Nurnberger, A., Paab, G.: A brief survey of text mining. J. Comput. Linguist. Lang. Technol. 20, 19–62 (2005)

    Google Scholar 

  18. Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE TKDE 20, 1264–1272 (2008)

    Google Scholar 

  19. Jiang, L., Li, C., Wang, S., Zhang, L.: Deep feature weighting for naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 52, 26–39 (2016)

    Article  Google Scholar 

  20. Li, Y.H., Jain, A.K.: Classification of text documents. Comput. J. 41(8), 537–546 (1998)

    Article  MATH  Google Scholar 

  21. Meena, M.J., Chandran, K.R., Karthik, A., Samuel, A.V.: An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Syst. Appl. 39, 5861–5871 (2012)

    Article  Google Scholar 

  22. Moradi, P., Gholampour, M.: A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Appl. Soft Comput. 43, 117–130 (2016)

    Article  Google Scholar 

  23. Nagendraswamy, H.S., Guru, D.S.: A new method of representing and matching two dimensional shapes. Int. J. Image Graph. 7(2), 377–405 (2007)

    Article  Google Scholar 

  24. Pinheiro, R.H.W., Cavalcanti, G.D.C., Ren, T.I.: Data-driven global-ranking local feature selection methods for text categorization. Expert Syst. Appl. 42, 1941–1949 (2015)

    Article  Google Scholar 

  25. Pinheiro, R.H.W., Cavalcanti, G.D.C., Correa, R.F., Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39, 12851–12857 (2012)

    Article  Google Scholar 

  26. Punitha, P., Guru, D.S.: Symbolic image indexing and retrieval by spatial similarity: an approach based on B-tree. Pattern Recogn. 41(6), 2068–2085 (2008)

    Article  MATH  Google Scholar 

  27. Rehman, A., et al.: Relative discrimination criterion - a novel feature ranking method for text data. Expert Syst. Appl. 42, 3670–3681 (2015)

    Article  Google Scholar 

  28. Revanasidappa, M.B., Harish, B.S., Manjunath, S.: Document classification using symbolic classifiers. In: International Conference on Contemporary Computing and Informatics (IC3I), pp. 299–303 (2014)

    Google Scholar 

  29. Rigutini, L.: Machine learning techniques. Ph.D. thesis, University of Siena

    Google Scholar 

  30. Sarkar, S.D., Goswami, S., Agarwal, A., Aktar, J.: A Novel Feature Selection Technique for Text Classification Using Naive Bayes, pp. 1–10. Hindawi Publishing Corporation (2014)

    Google Scholar 

  31. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  32. Shang, C., Li, M., Feng, S., Jiang, Q., Fan, J.: Feature selection via maximizing global information gain for text classification. Knowl. Based Syst. 54, 298–309 (2013)

    Article  Google Scholar 

  33. Tasci, S., Gungor, T.: Comparison of text feature selection policies and using an adaptive framework. Expert Syst. Appl. 40, 4871–4886 (2013)

    Article  Google Scholar 

  34. Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)

    Article  Google Scholar 

  35. Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)

    Article  Google Scholar 

  36. Wang, D., Zhang, H., Li, R., Lv, W., Wang, D.: t-Test feature selection approach based on term frequency for text categorization. Pattern Recogn. Lett. 45, 1–10 (2014)

    Article  Google Scholar 

  37. Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf. Process. Manage. 48, 741–754 (2012)

    Article  Google Scholar 

  38. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 412–420 (1997)

    Google Scholar 

  39. Zhang, L., Jiang, L., Li, C., Kong, G.: Two feature weighting approaches for naive Bayes text classifiers. Knowl. Based Syst. 100, 137–144 (2016)

    Article  Google Scholar 

  40. Zong, W., Wu, F., Chu, L.K., Sculli, D.: A discriminative and semantic feature selection method for text categorization. Int. J. Prod. Econ. 165, 215–222 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

The second author of this paper acknowledges the financial support rendered by the University of Mysore under UPE grants for the High Performance Computing laboratory. The first and fourth authors of this paper acknowledge the financial support rendered by Pillar4 Company, Bangalore.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahamad Suhil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Raju, L.N., Suhil, M., Guru, D.S., Gowda, H.S. (2017). Cluster Based Symbolic Representation for Skewed Text Categorization. In: Santosh, K., Hangarge, M., Bevilacqua, V., Negi, A. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2016. Communications in Computer and Information Science, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-4859-3_19

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-4859-3_19

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-4858-6

  • Online ISBN: 978-981-10-4859-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics