Skip to main content

Web Page Classification on News Feeds Using Hybrid Technique for Extraction

  • Conference paper
  • First Online:
Information and Communication Technology for Intelligent Systems

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 107))

Abstract

At the initial stage of web world, the number of websites hosted was handful, and from the users end, it was easy to maintain log file which consists of information like web pages URL or domain name, but as the number of web hosting increased gradually, it was found hard for the users to maintain such log files. Thus, the requirement exists which helped the users to search the information from the website easily which is now renowned as “Search Engines”; the only limitation found is that the users must be sound enough to give searching keywords in order to search relevant information, but in many cases, users have obtained irrelevant information from the web, and hence, looking into the current scenario of the Internet world, the number of websites has grown drastically holding various web pages within it. These web pages are observed to be published in structured or semi-structured manner which comprises of various multimedia contents, [(Nethra et al. in J Soft Comput 4:692–696, 2014) 1; (Kardan et al. in A novel approach for Keyword extraction in learning objects using text mining & WordNet. pp. 788–792, 2011) 2; (Menaka and Radha in Int J Adv Res Comput Sci Softw Eng 352:24–28, 2013) 3] so the chance to fetch wrong information also increased, and hence, there is a need to auto-categorize the web pages into some predetermined sections. The key point in this research is to recognize and allocate the news feeds into fixed sections of news like business, sports which enhance the reader’s accessibility towards relevant news by traversing appropriate category as per his/her choice, and this is done by adopting hybrid technique of URL analysis and content context analysis. The paper emphasis on a proposed model to perform classification on news feed related to various fields such as sports, health which starts with web crawling of URLs, scraping of news contents followed by the analysis carried out on account of generating keywords, weight calculation, and then at last identify the relevant category on the basis of contents fetched among various Indian news web portal like “The Times of India”, “Hindustan Times”, “The Guardian” [(Nethra et al. in J Soft Comput 4:692–696, 2014) 1; (Jena and Kamila in Int J Appl Innov Eng Manage 2, 2013) 4; (Ercan and Cicekli in Int J Inf Process Manage 43:1705–1714, 2007) 5].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

WPC:

Web Page Classification

References

  1. Nethra, K., et al.: Web content extraction using hybrid approach. ICTACT J. Soft Comput. 4(2), 692–696 (2014). ISSN 2229-6956

    Google Scholar 

  2. Kardan, A.A., et al.: A novel approach for Keyword extraction in learning objects using text mining & WordNet. In: Proceeding of 2nd World Conference on Information Technology, pp. 788–792 (2011)

    Google Scholar 

  3. Menaka, S., Radha, N.: Text classification using keyword extraction technique. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3(12) (2013). ISSN 2277-128X

    Google Scholar 

  4. Jena, L., Kamila, N.: Data extraction & web page categorization using text mining. Int. J. Appl. Innov. Eng. Manage. 2(6) (2013). ISSN 2319-4847

    Google Scholar 

  5. Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Int. J. Inf. Process. Manage. 43(6), 1705–1714 (2007)

    Article  Google Scholar 

  6. Mehtaa, P., et al.: Web personalization using web mining concept and research issue. Int. J. Inf. Educ. Technol. 2(5), 510 (2012). ISSN 2010-3689

    Google Scholar 

  7. Patil, G., Patil, A.: Web information extraction classification using vector space model algorithm. Int. J. Emerg. Technol. Adv. Eng. 1(2) (2011). ISSN 2250-2459

    Google Scholar 

  8. Chauhan, A., et al.: Cleaning web pages for relevant text extraction & text categorization. Int. J. Eng. Res. Technol. (IJERT) 2(1) (2013). ISSN 2278-0181

    Google Scholar 

  9. Singh, A.: Web content extraction to facilitate web mining. Int. J. Electr. Comput. Sci. Eng. 1(3) (2012). ISSN 2277-1956

    Google Scholar 

  10. Peng, X., Chaoi, B.: Document classification based on word semantic hierarchies. In: Proceeding of ACM International Conference on Artificial Intelligence & Application, pp. 362–367 (2005)

    Google Scholar 

  11. Antonis, M.-L., Zaiane, O.R.: Text document categorization by term association. In: Proceeding of IEEE International Conference on Data Mining, pp. 19–26 (2002)

    Google Scholar 

Download references

Acknowledgements

I extend by gratitude towards my research guide, Dr. Yogesh Kumar Sharma, for his wonderful support to carry out my research, and to my wife and my parents for encouraging me in all the dimensions of my life.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankit Dilip Patel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Patel, A.D., Sharma, Y.K. (2019). Web Page Classification on News Feeds Using Hybrid Technique for Extraction. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 107. Springer, Singapore. https://doi.org/10.1007/978-981-13-1747-7_38

Download citation

Publish with us

Policies and ethics