Skip to main content
Log in

Amelioration of linguistic semantic classifier with sentiment classifier manacle for the focused web crawler

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Sentiment relevant information in the web pages concerning products, establishment, and commodities concentrates principally on the available textual contents. Research on crawling topic-relevant web pages is far behind compared to sentiment-relevant web pages despite the steep rise in sentiment-relevant information on the web. This paper resolves the impediment issues and proposes a novel focused web crawler namely the Linguistic Semantic Sentiment (LSS) crawler which collects not only topic-relevant web pages but also sentiment-relevant web pages. Two classifiers are proposed in the relevance computation module of the LSS crawler, where one is a linguistic semantic classifier and the other is a sentiment classifier. The linguistic semantic classifier computes the semantic relevance of the web page concerning the topic, whereas the sentiment classifier computes the sentiment relevance of the web page. The performance of the LSS crawler is then analyzed by using the metrics, harvest rate, target recall, and F1-score. The LSS crawler outperformed the existing focused crawlers with an average harvest rate of 0.35, target recall of 0.55, and F1-score of 0.42. The evaluation results revealed that both the linguistic semantic and the sentiment classifiers enhanced the performance of the proposed LSS-focused crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Chauhan D, Sutaria K (2019) Multidimensional sentiment analysis on twitter with semiotics. Int j inf tecnol 11:677–682. https://doi.org/10.1007/s41870-018-0235-8

    Article  Google Scholar 

  2. Kumar P, Vardhan M (2022) PWEBSA: Twitter sentiment analysis by combining Plutchik wheel of emotion and word embedding. Int j inf tecnol 14:69–77. https://doi.org/10.1007/s41870-021-00767-y

    Article  Google Scholar 

  3. Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to top-specific web source discovery. Comput Netw 31(11–16):1623–1640

    Article  Google Scholar 

  4. Salton G, Wong A, Yang C (1975) Information retrieval and language processing: a vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  5. Pant G, Srinivasan P (2006) Link contexts in classifier-guided topical crawlers. IEEE Trans Knowl Data Eng 18(1):107–122

    Article  Google Scholar 

  6. Menczer F (2003) Complementing search engines with online web mining agents. Decis Support Syst 35(2):195–212

    Article  Google Scholar 

  7. Wang W, Chen X, Zou Y, Wang H, Dai Z (2010) A focused crawler based on naive Bayes classifier, 3rd Int Symp Intell Inf Technol Secur Informatics, IITSI 2010, pp. 517–521

  8. Li J, Furuse K, Yamaguchi K (2005) Focused crawling by exploiting anchor text using decision tree, Spec Interes tracks posters 14th Int Conf World Wide Web—WWW ’05, p. 1190

  9. Dong H, Hussain FK (2014) Self-adaptive semantic focused crawler for mining services information discovery. IEEE Trans Ind Inform 10(2):1616–1626

    Article  Google Scholar 

  10. Philip R (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on Artificial intelligence, Vol 1 (IJCAI'95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 448–453

  11. Du Y, Liu W, Lv X, Peng G (2015) An improved focused crawler based on Semantic similarity vector space model. Appl Soft Comput J 36:392–407

    Article  Google Scholar 

  12. Zhibiao W, Martha P (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL '94). Association for Computational Linguistics, USA, 133–138, https://doi.org/10.3115/981732.981751

  13. Joe Dhanith PR, Surendiran B (2019) An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm. Int J Comput Appl, 0(0): 1–7

  14. Zheng HT, Kang BY, Kim HG (2008) An ontology-based approach to learnable focused crawling. Inf Sci (NY) 178(23):4512–4522

    Article  Google Scholar 

  15. Dong H, Hussain FK (2013) SOF: a semi-supervised ontology-learning-based focused crawler. Concurr Comput Pract Exp 25(6):1755–1770

    Article  Google Scholar 

  16. Capuano A, Rinaldi AM, Russo C (2019) An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multimed Tools Appl

  17. Dhanith PRJ, Surendiran B, Raja SP (2021) A word embedding based approach for focused web crawling using the recurrent neural network. Int J Interact Multimed Artif Intell 6(6):122–132. https://doi.org/10.9781/ijimai.2020.09.003

    Article  Google Scholar 

  18. Suebchua T, Manaskasemsak B, Rungsawang A, Yamana H (2018) History-enhanced focused website segment crawler. Int Conf Inf Netw 2018:80–85

    Google Scholar 

  19. Ibrahim M, Yang Y, (2019) An ontology-based web crawling approach for the retrieval of materials in the educational domain, ICAART 2019—Proc 11th Int Conf Agents Artif Intell, 2, 900–906

  20. Hassan T, Cruz C, Bertaux A (2017) Predictive and evolutive cross-referencing for web textual sources. Computing Conf 2017:1114–1122. https://doi.org/10.1109/SAI.2017.8252230

    Article  Google Scholar 

  21. Hosseinkhani J, Taherdoost H, Keikhaee S (2021) ANTON framework based on semantic focused crawler to support web crime mining using SVM. Ann Data Sci 8:227–240. https://doi.org/10.1007/s40745-019-00208-5

    Article  Google Scholar 

  22. Hassan T, Cruz C, Bertaux A (2017) Ontology-based approach for unsupervised and adaptive focused crawling, Proc Int Work Semant Big Data, SBD 2017-conjunction with 2017 ACM SIGMOD/PODS Conf, pp. 1–6

  23. Saleh AI, Abulwafa AE, Al Rahmawy MF (2017) A web page distillation strategy for efficient focused crawling based on optimized Naïve bayes (ONB) classifier. Appl Soft Comput J 53:181–204

    Article  Google Scholar 

  24. Fu T, Abbasi A, Zeng D, Chen H (2012) Sentimental spidering. ACM Trans Inf Syst 30(4):1–30

    Article  Google Scholar 

  25. Vural AG, Cambazoglu BB, Senkul P (2014) Sentiment-focused web crawling. ACM Trans Web 8(4):22.1-22.21

    Article  Google Scholar 

  26. Mei J, Frank R (2015) Sentiment crawling: extremist content collection through a sentiment analysis guided web-crawler, Proc 2015 IEEE/ACM Int Conf Adv Soc Networks Anal. Mining, ASONAM 2015, pp. 1024–1027

  27. Geng Z, Shang D, Zhu Q, Wu Q, Han Y (2017) Research on improved focused crawler and its application in food safety public opinion analysis. Chinese Autom Congr, 2847–2852

  28. Rong X (2014) word2vec Parameter Learning Explained, http://arxiv.org/abs/1411.2738

  29. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst, 1–9

  30. Mahalleh ER, Gharehchopogh FS (2022) An automatic text summarization based on valuable sentences selection. Int J Inf Tecnol 14:2963–2969. https://doi.org/10.1007/s41870-022-01049-x

    Article  Google Scholar 

  31. Mandal S, Singh GK, Pal A (2021) Single document text summarization technique using optimal combination of cuckoo search algorithm, sentence scoring and sentiment score. Int J Inf Technol 13(5):1805–1813

    Google Scholar 

  32. Google, “Custom Search JSON API”, 2022. [Online]. Available: https://developers.google.com/custom-search/v1/introduction/

Download references

Funding

N/A.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. S. Sakunthala Prabha.

Ethics declarations

Conflict of interest

We declare that there is no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

N/A.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prabha, K.S.S., Mahesh, C., Goundar, S. et al. Amelioration of linguistic semantic classifier with sentiment classifier manacle for the focused web crawler. Int. j. inf. tecnol. 15, 1137–1149 (2023). https://doi.org/10.1007/s41870-022-01139-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-022-01139-w

Keywords

Navigation