Webpage Segments Classification with Incremental Knowledge Acquisition

  • Wei Guo
  • Yang Sok Kim
  • Byeong Ho Kang
Part of the Communications in Computer and Information Science book series (CCIS, volume 124)

Abstract

This paper suggests an incremental information extraction method for social network analysis of web publications. For this purpose, we employed an incremental knowledge acquisition method, called MCRDR (Multiple Classification Ripple-Down Rules), to classify web page segments. Our experimental results show that our MCRDR-based web page segments classification system successfully supports easy acquisition and maintenance of information extraction rules.

Keywords

Information extraction social networks knowledge acquisition 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation (2003)Google Scholar
  2. 2.
    Gregg, D.G., Walczak, S.: Adaptive Web Information Extraction. Commun. ACM. 49(5), 78–84 (2006)CrossRefGoogle Scholar
  3. 3.
    Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Intelligent Information Agents. Agentlink Perspective, pp. 79–103 (2003)Google Scholar
  4. 4.
    Kang, J., Choi, J.: Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction. Journal of Universal Computer Science 14(11), 1893–1910 (2008)Google Scholar
  5. 5.
    Turmo, J., Ageno, A., Catala, N.: Adaptive Information Extraction. ACM Comput. Surv. 38(2), 4 (2006)CrossRefGoogle Scholar
  6. 6.
    Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper Induction for Information Extraction. In: IJCAI 1997. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pp. 729–735 (1997)Google Scholar
  7. 7.
    Chidlovskii, B.: Information Extraction from Tree Documents by Learning Substree Delimiters. In: Workshop on Information Integration on the Web in 18th International Joint Conference on Artificial Intelligence (2003)Google Scholar
  8. 8.
    Debnath, S., Mitra, P., Giles, C.L.: Automatic Extraction of Informative Blocks from Webpages. In: 2005 ACM Symposium on Applied Computing, pp. 1722–1726. ACM Press, New York (2005)CrossRefGoogle Scholar
  9. 9.
    Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-Based Content Extraction of Html Documents. In: International World Wide Web Conference, pp. 207–214. ACM Press, New York (2003)Google Scholar
  10. 10.
    Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: SIGKDD 2002, Edmonton, Albert, Canada, (2002)Google Scholar
  11. 11.
    Pasternack, J., Roth, D.: Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: Proceedings of the 18th International Conference on World Wide Web, pp. 971–980. ACM, Madrid (2009)CrossRefGoogle Scholar
  12. 12.
    Gottron, T.: Combining Content Extraction Heuristics: The <I>Combine</I> System. In: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, pp. 591–595. ACM, Linz (2008)Google Scholar
  13. 13.
    Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning Block Importance Models for Web Pages. In: 13th International Conference on World Wide Web, pp. 203–211. ACM Press, New York (2004)Google Scholar
  14. 14.
    Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning Important Models for Web Page Blocks Based on Layout and Content Analysis. SIGKDD Explor. Newsl. 6(2), 14–23 (2004)CrossRefGoogle Scholar
  15. 15.
    Bar-Yossef, Z., Rajagopalan, S.: Template Detection Via Data Mining and Its Applications. In: WWW 2002, Honolulu, Hawaii, USA, (2002) Google Scholar
  16. 16.
    Chakrabarti, D., Kumar, R., Punera, K.: Page-Level Template Detection Via Isotonic Smoothing. In: Proceedings of the 16th international conference on World Wide Web, pp. 61–70. ACM, Banff (2007)CrossRefGoogle Scholar
  17. 17.
    Vieira, K., da Costa Carvalho, A., Berlt, K., de Moura, E., da Silva, A., Freire, J.: On Finding Templates on Web Collections. World Wide Web 12(2), 171–211 (2009)CrossRefGoogle Scholar
  18. 18.
    Wang, Y., Fang, B., Cheng, X., Guo, L., Xu, H.: Incremental Web Page Template Detection. In: Proceeding of the 17th international conference on World Wide Web, pp. 1247–1248. ACM, Beijing (2008)CrossRefGoogle Scholar
  19. 19.
    Compton, P., Edwards, G., Kang, B., Lazarus, L., Malor, R., Menzies, T., Preston, P., Srinivasan, A., Sammut, C.: Ripple Down Rules: Possibilities and Limitations. In: 6th Bannf AAAI Knowledge Acquisition for Knowledge Based Systems Workshop, Banff, Canada, pp. 6-1–6-20 (1991)Google Scholar
  20. 20.
    Compton, P., Edwards, G., Kang, B., Lazarus, L., Malor, R., Preston, P., Srinivasan, A.: Ripple Down Rules: Turning Knowledge Acquisition into Knowledge Maintenance. Artificial Intelligence in Medicine 4(6), 463–475 (1992)CrossRefGoogle Scholar
  21. 21.
    Compton, P., Jansen, R.: A Philosophical Basis for Knowledge Acquisition. Knowledge Acquisition 2(3), 241–258 (1990)CrossRefGoogle Scholar
  22. 22.
    Compton, P., Kang, B., Preston, P., Mulholland, M.: Knowledge Acquisition without Analysis. In: Aussenac, N., Boy, G.A., Ganascia, J.-G., Kodratoff, Y., Linster, M., Gaines, B.R. (eds.) EKAW 1993. LNCS, vol. 723, pp. 277–299. Springer, Heidelberg (1993)CrossRefGoogle Scholar
  23. 23.
    Kang, B.H., Gambetta, W., Compton, P.: Verification and Validation with Ripple-Down Rules. International Journal of Human-Computer Studies 44(2), 257–269 (1996)CrossRefGoogle Scholar
  24. 24.
    Kang, B., Compton, P., Preston, P.: Multiple Classification Ripple Down Rules: Evaluation and Possibilities. In: 9th AAAI-Sponsored Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Canada, University of Calgary (1995)Google Scholar
  25. 25.
    Park, S.S., Kim, Y.S., Kang, B.H.: Web Document Classification: Managing Context Change. In: IADIS International Conference WWW/Internet 2004, Madrid, Spain, pp. 143–151 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Wei Guo
    • 1
  • Yang Sok Kim
    • 2
  • Byeong Ho Kang
    • 1
  1. 1.University of TasmaniaSandy BayAustralia
  2. 2.University of New South WalesSydneyAustralia

Personalised recommendations