Information Extraction from Semi-structured Web Documents

  • Bo-Hyun Yun
  • Chang-Ho Seo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4092)


This paper proposes the web information extraction system that extracts the pre-defined information automatically from web documents (i.e. HTML documents) and integrates the extracted information. The system recognizes entities without labels by the probabilistic based entity recognition method and extends the existing domain knowledge semiautomatically by using the extracted data. Moreover, the system extracts the sub-linked information linked to the basic page and integrates the similar results extracted from heterogeneous sources. The experimental result shows that the global precision of seven domain sites is 93.5%. The system using the sub-linked information and the probabilistic based entity recognition enhances the precision significantly against the system using only the domain knowledge. Moreover, the presented system can extract the more various information precisely due to applying the system with flexibility according to domains. Thus, the system can increase the degree of user satisfaction at its maximum and contribute the revitalization of e-business.


Information Source Domain Knowledge Information Extraction Learning Data Token Probability 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adelberg, B.: NoDoSE- A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: ACM SIGMOD (1998)Google Scholar
  2. 2.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD (2003)Google Scholar
  3. 3.
    Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto. LNCS. Springer, Heidelberg (2001)Google Scholar
  4. 4.
    Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proceedings of the 1998 Conference on Computational Learning Theory (1998)Google Scholar
  5. 5.
    Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: Proceedings of the 2001 International Conference on Distrubuted Computing Systems (May 2001)Google Scholar
  6. 6.
    Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction, PhD thesis, University of Texas at Austin (August 1998)Google Scholar
  7. 7.
    Ciravegna, F.: Learning to Tag for Information Extraction from Text. In: Workshop Machine Learning for Information Extraction, European Conference on Artifical Intelligence ECCAI, August, Berlin, Germany (2000)Google Scholar
  8. 8.
    Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)Google Scholar
  9. 9.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of 27th International Conference on Very Large Data Bases (2001)Google Scholar
  10. 10.
    Eikvil, L.: Information Extraction from World Wide Web: A Survey, Report No. 945 (July 1999) ISBN 82-539-0429-0Google Scholar
  11. 11.
    Embley, D.W., Campbell, D.M., Jiang, Y.S., Ng, Y.-K., Smith, R.D., Liddle, S.W., Quass, D.W.: A Conceptual-Modeling Approach to Extracting Data from the Web. In: International Conference on Conceptual Modeling / the Entity Relationship Approach (1998)Google Scholar
  12. 12.
    Freitag, D.: Machine Learning for Information Extraction in Informal Domains, PhD thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA (November 1998)Google Scholar
  13. 13.
    Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of the Seventh National Conference on Artificial, pp. 577–583 (2000)Google Scholar
  14. 14.
    Gruser, J.R., Raschid, L., Vidal, M.E., Bright, L.: Wrapper Generation for Web Accessible Data Sources. In: Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems, New York (August 1998)Google Scholar
  15. 15.
    Hsu, C.N., Chang, C.C.: Finite-State Transducers for Semi-Structured Text Mining. In: Workshop on Text Mining IJCAI 1999 (1999)Google Scholar
  16. 16.
    Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS, vol. 1925, p. 247. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  17. 17.
    Kushmerick, N.: Gleaning the Web. IEEE Intelligent Systems 14(2), 20–22 (1999)CrossRefGoogle Scholar
  18. 18.
    Kushmerick, N., Thomas, B.: Intelligent Information Agents R&D in Europe: An AgentLink perspective. In: Adaptive Information Extraction: A Core Technology for Information Agents. Springer, Heidelberg (2002)Google Scholar
  19. 19.
    Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engineering (2000)Google Scholar
  20. 20.
    Merialdo, P., Atzeni, P., Mecca, G.: Design and development of data-intensive web sites: The araneus approach. ACM Transaction on Internet Technology TOIT 3(1), 49–92 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Bo-Hyun Yun
    • 1
  • Chang-Ho Seo
    • 2
  1. 1.Dept. of Computer EducationMokwon UniversityTaejonKorea
  2. 2.Dept. of Applied MathematicsKongju UniversityKongju-CityKorea

Personalised recommendations