Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining

  • Amit Dutta
  • Sudipta Paria
  • Tanmoy Golui
  • Dipak Kumar Kole
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 27)

Abstract

Web content mining is used for discovering useful knowledge or information from the web page. So, noisy data in web document significantly affect the performance of web content mining. In this paper, a noise elimination method has been proposedbased on regular expression followed by Site Style Tree (SST). The proposed technique consists of two phases. In the first phase, filtering method based on regular expression is used on web pages to remove noisy HTML tags The filtered document then undergoes to second phase where an entropy based measured is used for removing further noise. The page size is reduced considerably by eliminate a number of lines of code preceded by some predefined noisy HTML tags. The con-sized web document is then used to form Document Object Model (DOM) tree and consequently the Site Style Tree is formed by crawling the pages from the same URL path as of the website. The experiment conducted on some most popular websites like www.amazon.com, www.yahoo.com and www.abcnews.com. The experimental result reveals that the filtering method eliminates a significant amount of noise before introduction of SST, so the overall space and time complexity is reduced compared to other SST based approach.

Keywords

Noise Web Mining Web Content Mining Regular Expression DOM Tree Site Style Tree (SST) Node Importance Composite Importance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Han, J., Chang, K.C.-C.: Data Mining for Web Intelligence. IEEE Computer 35(11), 64–70 (2002)CrossRefGoogle Scholar
  2. 2.
    Srivastava, J., Desikan, P., Kumar, V.: Web Mining - Concepts, Applications, and Research Directions. In: Chu, W., Lin, T.Y. (eds.) Foundations and Advances in Data Mining. STUDFUZZ, vol. 180, pp. 275–307. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Sabnis, V., Thakur, R.S.: Department of Computer Applications, MANIT, Bhopal, India, GA Based Model for Web Content Mining. IJCSI International Journal of Computer Science Issues 10(2), 3 (2013)Google Scholar
  4. 4.
    Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Transactions on Internet Technology 3(1), 1–27 (2003)CrossRefGoogle Scholar
  5. 5.
    Abraham, A.: Business Intelligence from Web Usage Mining. Journal of Information & Knowledge Management 2(4), 375–390 (2003)CrossRefGoogle Scholar
  6. 6.
    Turney, P.: Coherent Keyphrase Extraction via Web Mining. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 434–439 (2003)Google Scholar
  7. 7.
    Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593 (2002)Google Scholar
  8. 8.
    Bar-Yossef, Z., Rajagopalan, S.: Template Detection via Data Mining and its Applications. In: Proceedings of the 11th International Conference on World Wide Web (2002)Google Scholar
  9. 9.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning (1997)Google Scholar
  10. 10.
    Kushmerick, N.: Learning to remove Internet advertisements. In: Proceedings of Third Annual Conference on Autonomous Agents, pp. 175–181 (1999)Google Scholar
  11. 11.
    Kao, J.Y., Lin, S.H., Ho, J.M., Chen, M.S.: Entropy-based link analysis for mining web informative structures. In: Proceedings of Eleventh International Conference on Information and Knowledge Management, pp. 574–581 (2002)Google Scholar
  12. 12.
    Fried, J.: Mastering regular expressions. O’Reilly Media Inc. (2006)Google Scholar
  13. 13.
    Lan, Y., Bing, L., Xiaoli, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 296–305 (2003)Google Scholar
  14. 14.
    Kang, B.H., Kim, Y.S.: Noise Elimination from the Web Documents by using URL paths and Information Redundancy (2006)Google Scholar
  15. 15.
    Cormen, T.H., Leiserson, C.E., Ronald, R.L., Clifford, S.: Introduction to Algorithm. The MIT Press (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Amit Dutta
    • 1
  • Sudipta Paria
    • 2
  • Tanmoy Golui
    • 2
  • Dipak Kumar Kole
    • 2
  1. 1.Department of ITSt. Thomas’ College of Engineering & TechnologyKolkataIndia
  2. 2.Department of CSESt. Thomas’ College of Engineering & TechnologyKolkataIndia

Personalised recommendations