Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining
Web content mining is used for discovering useful knowledge or information from the web page. So, noisy data in web document significantly affect the performance of web content mining. In this paper, a noise elimination method has been proposedbased on regular expression followed by Site Style Tree (SST). The proposed technique consists of two phases. In the first phase, filtering method based on regular expression is used on web pages to remove noisy HTML tags The filtered document then undergoes to second phase where an entropy based measured is used for removing further noise. The page size is reduced considerably by eliminate a number of lines of code preceded by some predefined noisy HTML tags. The con-sized web document is then used to form Document Object Model (DOM) tree and consequently the Site Style Tree is formed by crawling the pages from the same URL path as of the website. The experiment conducted on some most popular websites like www.amazon.com, www.yahoo.com and www.abcnews.com. The experimental result reveals that the filtering method eliminates a significant amount of noise before introduction of SST, so the overall space and time complexity is reduced compared to other SST based approach.
KeywordsNoise Web Mining Web Content Mining Regular Expression DOM Tree Site Style Tree (SST) Node Importance Composite Importance
Unable to display preview. Download preview PDF.
- 3.Sabnis, V., Thakur, R.S.: Department of Computer Applications, MANIT, Bhopal, India, GA Based Model for Web Content Mining. IJCSI International Journal of Computer Science Issues 10(2), 3 (2013)Google Scholar
- 6.Turney, P.: Coherent Keyphrase Extraction via Web Mining. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 434–439 (2003)Google Scholar
- 7.Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593 (2002)Google Scholar
- 8.Bar-Yossef, Z., Rajagopalan, S.: Template Detection via Data Mining and its Applications. In: Proceedings of the 11th International Conference on World Wide Web (2002)Google Scholar
- 9.Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning (1997)Google Scholar
- 10.Kushmerick, N.: Learning to remove Internet advertisements. In: Proceedings of Third Annual Conference on Autonomous Agents, pp. 175–181 (1999)Google Scholar
- 11.Kao, J.Y., Lin, S.H., Ho, J.M., Chen, M.S.: Entropy-based link analysis for mining web informative structures. In: Proceedings of Eleventh International Conference on Information and Knowledge Management, pp. 574–581 (2002)Google Scholar
- 12.Fried, J.: Mastering regular expressions. O’Reilly Media Inc. (2006)Google Scholar
- 13.Lan, Y., Bing, L., Xiaoli, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 296–305 (2003)Google Scholar
- 14.Kang, B.H., Kim, Y.S.: Noise Elimination from the Web Documents by using URL paths and Information Redundancy (2006)Google Scholar
- 15.Cormen, T.H., Leiserson, C.E., Ronald, R.L., Clifford, S.: Introduction to Algorithm. The MIT Press (2009)Google Scholar