Web Page Classification on News Feeds Using Hybrid Technique for Extraction

Patel, Ankit Dilip; Sharma, Yogesh Kumar

doi:10.1007/978-981-13-1747-7_38

Ankit Dilip Patel⁵ &
Yogesh Kumar Sharma⁶

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 107))

1441 Accesses
7 Citations

Abstract

At the initial stage of web world, the number of websites hosted was handful, and from the users end, it was easy to maintain log file which consists of information like web pages URL or domain name, but as the number of web hosting increased gradually, it was found hard for the users to maintain such log files. Thus, the requirement exists which helped the users to search the information from the website easily which is now renowned as “Search Engines”; the only limitation found is that the users must be sound enough to give searching keywords in order to search relevant information, but in many cases, users have obtained irrelevant information from the web, and hence, looking into the current scenario of the Internet world, the number of websites has grown drastically holding various web pages within it. These web pages are observed to be published in structured or semi-structured manner which comprises of various multimedia contents, [(Nethra et al. in J Soft Comput 4:692–696, 2014) 1; (Kardan et al. in A novel approach for Keyword extraction in learning objects using text mining & WordNet. pp. 788–792, 2011) 2; (Menaka and Radha in Int J Adv Res Comput Sci Softw Eng 352:24–28, 2013) 3] so the chance to fetch wrong information also increased, and hence, there is a need to auto-categorize the web pages into some predetermined sections. The key point in this research is to recognize and allocate the news feeds into fixed sections of news like business, sports which enhance the reader’s accessibility towards relevant news by traversing appropriate category as per his/her choice, and this is done by adopting hybrid technique of URL analysis and content context analysis. The paper emphasis on a proposed model to perform classification on news feed related to various fields such as sports, health which starts with web crawling of URLs, scraping of news contents followed by the analysis carried out on account of generating keywords, weight calculation, and then at last identify the relevant category on the basis of contents fetched among various Indian news web portal like “The Times of India”, “Hindustan Times”, “The Guardian” [(Nethra et al. in J Soft Comput 4:692–696, 2014) 1; (Jena and Kamila in Int J Appl Innov Eng Manage 2, 2013) 4; (Ercan and Cicekli in Int J Inf Process Manage 43:1705–1714, 2007) 5].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

WPC:: Web Page Classification

References

Nethra, K., et al.: Web content extraction using hybrid approach. ICTACT J. Soft Comput. 4(2), 692–696 (2014). ISSN 2229-6956
Google Scholar
Kardan, A.A., et al.: A novel approach for Keyword extraction in learning objects using text mining & WordNet. In: Proceeding of 2nd World Conference on Information Technology, pp. 788–792 (2011)
Google Scholar
Menaka, S., Radha, N.: Text classification using keyword extraction technique. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3(12) (2013). ISSN 2277-128X
Google Scholar
Jena, L., Kamila, N.: Data extraction & web page categorization using text mining. Int. J. Appl. Innov. Eng. Manage. 2(6) (2013). ISSN 2319-4847
Google Scholar
Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Int. J. Inf. Process. Manage. 43(6), 1705–1714 (2007)
Article Google Scholar
Mehtaa, P., et al.: Web personalization using web mining concept and research issue. Int. J. Inf. Educ. Technol. 2(5), 510 (2012). ISSN 2010-3689
Google Scholar
Patil, G., Patil, A.: Web information extraction classification using vector space model algorithm. Int. J. Emerg. Technol. Adv. Eng. 1(2) (2011). ISSN 2250-2459
Google Scholar
Chauhan, A., et al.: Cleaning web pages for relevant text extraction & text categorization. Int. J. Eng. Res. Technol. (IJERT) 2(1) (2013). ISSN 2278-0181
Google Scholar
Singh, A.: Web content extraction to facilitate web mining. Int. J. Electr. Comput. Sci. Eng. 1(3) (2012). ISSN 2277-1956
Google Scholar
Peng, X., Chaoi, B.: Document classification based on word semantic hierarchies. In: Proceeding of ACM International Conference on Artificial Intelligence & Application, pp. 362–367 (2005)
Google Scholar
Antonis, M.-L., Zaiane, O.R.: Text document categorization by term association. In: Proceeding of IEEE International Conference on Data Mining, pp. 19–26 (2002)
Google Scholar

Download references

Acknowledgements

I extend by gratitude towards my research guide, Dr. Yogesh Kumar Sharma, for his wonderful support to carry out my research, and to my wife and my parents for encouraging me in all the dimensions of my life.

Author information

Authors and Affiliations

Shri J. J. T. University, Jhunjhunu, Rajasthan, India
Ankit Dilip Patel
Department of Computer Science, Shri J. J. T. University, Jhunjhunu, India
Yogesh Kumar Sharma

Authors

Ankit Dilip Patel
View author publications
You can also search for this author in PubMed Google Scholar
Yogesh Kumar Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankit Dilip Patel .

Editor information

Editors and Affiliations

School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, India
Suresh Chandra Satapathy
Sabar Institute of Technology, Gujarat Technological University, Ahmedabad, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patel, A.D., Sharma, Y.K. (2019). Web Page Classification on News Feeds Using Hybrid Technique for Extraction. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 107. Springer, Singapore. https://doi.org/10.1007/978-981-13-1747-7_38

Download citation

DOI: https://doi.org/10.1007/978-981-13-1747-7_38
Published: 15 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1746-0
Online ISBN: 978-981-13-1747-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics