Advertisement

Efficient Feature Selection Based on Modified Cuckoo Search Optimization Problem for Classifying Web Text Documents

  • Ankita DharEmail author
  • Niladri Sekhar Dash
  • Kaushik Roy
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1037)

Abstract

The continuous increase of information in the web with varying dimensions is becoming difficult for users to filter and analyse them efficiently as it incorporates redundant and irrelevant terms. Managing, filtering and organizing such huge datasets need the classification of text documents to be performed. Text classification is the process of assigning the text documents to their predefined text categories based on the content. The aim of this paper is to explore Cuckoo search optimization (CSO) problem established from the behaviour of cuckoo birds for selection of relevant features by modifying the algorithm. The revised algorithm is named as modified Cuckoo search (MCS) optimization algorithm that can be proved to be useful for developing an efficient text classification system. The proposed method is generated by combining the ability of MCS with the sharpness of Naive Bayes Multinomial (NBM) algorithm for generating proper feature which increases the rate of success. The approach adopted here is tested on 9000 text documents that cover eight different domains fetched from several web sources and obtains encouraging outcome. The results compared with the results from other well-known approaches for text classification task show the effectiveness of the proposed approach as an automatic Bangla text classification system.

Keywords

Text classification Meta-heuristic Modified Cuckoo search Feature selection Naive Bayes Multinomial 

Notes

Acknowledgement

One of the authors thank DST for the INSPIRE fellowship and also thank various links provided in [7] from which the data has been collected.

References

  1. 1.
    Al-Radaideh, Q.A., Al-Khateeb, S.S.: An associative rule-based classifier for Arabic medical text. Int. J. Knowl. Eng. Data Min. 03, 255–273 (2015)CrossRefGoogle Scholar
  2. 2.
    Aly, W., Kelleny, H.A.: Adaptation of Cuckoo search for documents clustering. Int. J. Comput. Appl. Technol. 86, 4–10 (2014)Google Scholar
  3. 3.
    ArunaDevi, K., Saveeth, R.: A novel approach on tamil text classification using C-Feature. Int. J. Sci. Res. Dev. 2, 343–345 (2014)Google Scholar
  4. 4.
    Bolaj, P., Govilkar, S.: Text classification for Marathi documents using supervised learning methods. Int. J. Comput. Appl. 155, 6–10 (2016)Google Scholar
  5. 5.
    Bouguelia, M.R., Nowaczyk, S., Santosh, K.C., Verikas, A.: Agreeing to disagree: active learning with noisy labels without crowdsourcing. Int. J. Mach. Learn. Cybern. 9, 1307–1319 (2018)CrossRefGoogle Scholar
  6. 6.
    DeySarkar, S., Goswami, S., Agarwal, A., Akhtar, J.: A novel feature selection technique for text classification using Naive Bayes. Int. Sch. Res. Not. 2014, 10 (2014)Google Scholar
  7. 7.
    Dhar, A., Dash, N.S., Roy, K.: Categorization of bangla web text documents based on TF-IDF-ICF text analysis scheme. In: Mandal, J.K., Sinha, D. (eds.) CSI 2018. CCIS, vol. 836, pp. 477–484. Springer, Singapore (2018).  https://doi.org/10.1007/978-981-13-1343-1_39CrossRefGoogle Scholar
  8. 8.
    Gupta, N., Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach. In: Proceedings of the 3rd Workshop on South and South East Asian Natural Language Processing, pp. 109–122 (2012)Google Scholar
  9. 9.
    Guru, D.S., Suhil, M.: A novel term\_ class relevance measure for text categorization. In: Proceedings of International Conference on Advanced Computing Technologies and Applications, pp. 13–22 (2015)CrossRefGoogle Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)CrossRefGoogle Scholar
  11. 11.
    Islam, Md.S., Jubayer, F.E.Md., Ahmed, S.I.: A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: Proceedings of International Conference on Electrical, Computer and Communication Engineering, pp. 191–196 (2017)Google Scholar
  12. 12.
    Jin, P., Zhang, Y., Chen, X., Xia, Y.: Bag-of-embeddings for text classification. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 2824–2830 (2016)Google Scholar
  13. 13.
    Kabir, F., Siddique, S., Kotwal, M.R.A., Huda, M.N.: Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of International Conference on Cognitive Computing and Information Processing, pp. 1–4 (2015)Google Scholar
  14. 14.
    Kim, S., Han, K., Rim, H., Myaeng, S.: Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18, 1457–1466 (2006)CrossRefGoogle Scholar
  15. 15.
    Mandal, A.K., Sen, R.: Supervised learning methods for Bangla web document categorization. Int. J. Artif. Intell. Appl. 05, 93–105 (2014)Google Scholar
  16. 16.
    Mansur, M., UzZaman, N., Khan, M.: Analysis of N-gram based text categorization for Bangla in a Newspaper Corpus. In: Proceedings of International Conference on Computer and Information Technology, p. 08 (2006)Google Scholar
  17. 17.
    Rautray, R., Balabantaray, R.C.: CSTS: cuckoo search based model for text summarization. In: Dash, S.S., Vijayakumar, K., Panigrahi, B.K., Das, S. (eds.) Artificial Intelligence and Evolutionary Computations in Engineering Systems. AISC, vol. 517, pp. 141–150. Springer, Singapore (2017).  https://doi.org/10.1007/978-981-10-3174-8_13CrossRefGoogle Scholar
  18. 18.
    Redmond, M., Salesi, S., Cosma, G.: A novel approach based on an extended cuckoo search algorithm for the classification of tweets which contain Emoticon and Emoji. In: Proceedings of IEEE International Conference on Knowledge Engineering and Applications, pp. 13–19 (2017)Google Scholar
  19. 19.
    Sujana, T.S., Rao, N.M.S., Reddy, R.S.: An efficient feature selection using parallel cuckoo search and Naive Bayes classifier. In: Proceedings of IEEE International Conference on Networks & Advances in Computational Technologies, pp. 167–172 (2017)Google Scholar
  20. 20.
    Vajda, S., Santosh, K.C.: A fast k-nearest neighbor classifier using unsupervised clustering. In: Santosh, K.C., Hangarge, M., Bevilacqua, V., Negi, A. (eds.) RTIP2R 2016. CCIS, vol. 709, pp. 185–193. Springer, Singapore (2017).  https://doi.org/10.1007/978-981-10-4859-3_17CrossRefGoogle Scholar
  21. 21.
    Wang, D., Zhang, H., Liu, R., Lv, W.: Feature selection based on term frequency and T-Test for text categorization. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 1482–1486 (2012)Google Scholar
  22. 22.
    Wilbur, W.J., Kim, W.: The ineffectiveness of within-document term frequency in text classification. Inf. Retrieval 12, 509–525 (2009)CrossRefGoogle Scholar
  23. 23.
    Yang, X.S., Deb, S.: Cuckoo search via Levy flights. World Congress on Nature & Biologically Inspired Computing, pp. 210–214 (2009)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Ankita Dhar
    • 1
    Email author
  • Niladri Sekhar Dash
    • 2
  • Kaushik Roy
    • 1
  1. 1.Department of Computer ScienceWest Bengal State UniversityKolkataIndia
  2. 2.Linguistic Research Unit, Indian Statistical InstituteKolkataIndia

Personalised recommendations