Semantic Partitioning of Web Pages

  • Srinivas Vadrevu
  • Fatih Gelgi
  • Hasan Davulcu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3806)


In this paper we describe the semantic partitioner algorithm, that uses the structural and presentation regularities of the Web pages to automatically transform them into hierarchical content structures. These content structures enable us to automatically annotate labels in the Web pages with their semantic roles, thus yielding meta-data and instance information for the Web pages. Experimental results with the TAP knowledge base and computer science department Web sites, comprising 16,861 Web pages indicate that our algorithm is able gather meta-data accurately from various types of Web pages. The algorithm is able to achieve this performance without any domain specific engineering requirement.


Regular Expression Semantic Role Attribute Label Kleene Star Grammar Induction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Conference on Cooperative Information Systems, pp. 160–169 (1997)Google Scholar
  2. 2.
    Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)Google Scholar
  3. 3.
    Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 312–326. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Dill, S., Tomlin, J.A., Zien, J.Y., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A.: Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In: WWW, pp. 178–186 (2003)Google Scholar
  5. 5.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall. In: Intl. World Wide Web Conf. (2004)Google Scholar
  6. 6.
    Yang, G., Tan, W., Mukherjee, S., Ramakrishnan, I.V., Davulcu, H.: On the power of semantic partitioning of web documents. In: Workshop on Information Integration on the Web, Acapulco, Mexico (2003)Google Scholar
  7. 7.
    Noy, N., Musen, M.: Prompt: Algorithm and tool for automated ontology merging and alignment. In: Proceedings of the 17th Conference of the American Association for Artificial Intelligence (AAAI). AAAI Press, Menlo Park (2000)Google Scholar
  8. 8.
    Hearst, M.A.: Untangling text data mining. In: Association for Computational Linguistics (1999)Google Scholar
  9. 9.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)Google Scholar
  10. 10.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD, San Diego, USA (2003)Google Scholar
  11. 11.
    Guha, R.V., McCool, R.: Tap: A semantic web toolkit. Semantic Web Journal (2003)Google Scholar
  12. 12.
    Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A system for extracting document type descriptors from xml documents. In: ACM SIGMOD (2000)Google Scholar
  13. 13.
    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Technical Report (2003)Google Scholar
  14. 14.
    Chkrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: WWW (2001)Google Scholar
  15. 15.
    Gelgi, F., Vadrevu, S., Davulcu, H.: Improving web data annotations with spreading activation. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 95–106. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Srinivas Vadrevu
    • 1
  • Fatih Gelgi
    • 1
  • Hasan Davulcu
    • 1
  1. 1.Department of Computer Science and EngineeringArizona State UniversityTempeUSA

Personalised recommendations