Categorization of Bangla Web Text Documents Based on TF-IDF-ICF Text Analysis Scheme

  • Ankita DharEmail author
  • Niladri Sekhar DashEmail author
  • Kaushik RoyEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 836)


With the rapid growth and huge availability of digital text data, automatic text categorization or classification is a comparatively more effective solution in organizing and managing textual information. It is a process of automatically assigning a text document into one of the predefined sets of text categories. Although plenty of methods have been implemented on English text documents for categorization, limited studies are carried out on the Indian language texts including Bangla. Against this background, this paper analyzes the efficiency of some of the existing text classification methods available to us and proposes to supplement these with a new analysis method for classifying the Bangla text documents obtained from online web sources. The paper argues that addition of Inverse Class Frequency (ICF) measure to the Term Frequency (TF) and Inverse Document Frequency (IDF) methods can yield better responses in the act of feature extraction from a language like Bangla. The combination of all three processes generates a set of features which is further fed to train the MultiLayer Perceptron (MLP) classifier to produce promising results in identifying and classifying text documents to their respective domains and categories. Comparison of this classifier with others confirms that this has higher accuracy level in case of Bangla text documents. It is expected that MLP can produce satisfactory performance in terms of high dimensionality and relatively noisy feature vectors also.


Bangla text classification Term Frequency Inverse Document Frequency Inverse Class Frequency MLP Corpus 



One of the authors thank DST for support in the form of INSPIRE fellowship.


  1. 1.
    Sarkar, S.D., Goswami, S., Agarwal, A., Akhtar, J.: A novel feature selection technique for text classification using Naive Bayes. Int. Sch. Res. Not. 2014, 10 (2014)Google Scholar
  2. 2.
    Guru, D.S., Suhil, M.: A novel Term_Class relevance measure for text categorization. In: Proceedings of International Conference on Advanced Computing Technologies and Applications, pp. 13–22 (2015)Google Scholar
  3. 3.
    Jin, P., Zhang, Y., Chen, X., Xia, Y.: Bag-of-Embeddings for text classification. In: Proceedings of the 25tth International Joint Conference on Artificial Intelligence, pp. 2824–2830 (2016)Google Scholar
  4. 4.
    Wang, D., Zhang, H., Liu, R., Lv, W.: Feature selection based on term frequency and T-Test for text categorization. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1482–1486 (2012)Google Scholar
  5. 5.
    Gupta, N., Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach. In: Proceedings of the 3rd Workshop on South and South East Asian Natural Language Processing, pp. 109–122 (2012)Google Scholar
  6. 6.
    Mansur, M., UzZaman, N., Khan, M.: Analysis of N-Gram based text categorization for bangla in a newspaper corpus. In: Proceedings of International Conference on Computer and Information Technology, p. 08 (2006)Google Scholar
  7. 7.
    Mandal, A.K., Sen, R.: Supervised learning methods for bangla web document categorization. Int. J. Artif. Intell. Appl. 05, 93–105 (2014)Google Scholar
  8. 8.
    Kabir, F., Siddique, S., Kotwal, M.R.A., Huda, M.N.: Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of International Conference on Cognitive Computing and Information Processing, pp. 1–4 (2015)Google Scholar
  9. 9.
    Islam, S., Jubayer, F.E., Ahmed, S.I.: A comparative study on different types of approaches to Bengali document categorization. In: Proceedings of International Conference on Engineering Research, Innovation and Education, p. 06 (2017)Google Scholar
  10. 10.
    Islam, S., Jubayer, F.E., Ahmed, S.I.: A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: Proceedings of International Conference on Electrical, Computer and Communication Engineering, pp. 191–196 (2017)Google Scholar
  11. 11.
    ArunaDevi, K., Saveetha, R.: A novel approach on Tamil text classification using C-Feature. Int. J. Sci. Res. Dev. 2, 343–345 (2014)Google Scholar
  12. 12.
    Swamy, M.N., Thappa, M.H.: Indian language text representation and categorization using supervised learning algorithm. Int. J. Data Min. Tech. Appl. 02, 251–257 (2013)Google Scholar
  13. 13.
    Patil, J.J., Bogiri, N.: Automatic text categorization marathi documents. Int. J. Adv. Res. Comput. Sci. Manag. Stud. 03, 280–287 (2015)Google Scholar
  14. 14.
    Bolaj, P., Govilkar, S.: Text classification for marathi documents using supervised learning methods. Int. J. Comput. Appl. 155, 6–10 (2016)Google Scholar
  15. 15.
    Al-Radaideh, Q.A., Al-Khateeb, S.S.: An associative rule-based classifier for Arabic medical text. Int. J. Knowl. Eng. Data Min. 03, 255–273 (2015)CrossRefGoogle Scholar
  16. 16.
    Haralambous, Y., Elidrissi, Y., Lenca, P.: Arabic language text classification using dependency syntax-based feature selection. In: Proceedings of International Conference on Arabic Language Processing, p. 10 (2014)Google Scholar
  17. 17.
    Ahlgren, P., Jarneving, B., Rousseau, R.: Requirements for a cocitation similarity measure, with special reference to pearson’s correlation coefficient. J. Am. Soc. Inform. Sci. Technol. 54, 550–560 (2003)CrossRefGoogle Scholar
  18. 18.
    Prusa, J.D., Khoshgoftaar, T.M.: Improving deep neural network design with new text data representations. J. Big Data 04, 16 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceWest Bengal State UniversityKolkataIndia
  2. 2.Linguistic Research UnitIndian Statistical InstituteBaranagarIndia

Personalised recommendations