Using Grammatical Inference to Automate Information Extraction from the Web

  • Theodore W. Hong
  • Keith L. Clark
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2168)


The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-speci.c knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.


Real Estate Inference Algorithm Inductive Logic Programming Book Price Grammatical Inference 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    S. Abiteboul, “Querying semi-structured data,” in Database Theory, 6th International Conference (ICDT’ 97), Delphi, Greece, 1–18. Springer (1997).Google Scholar
  2. 2.
    H. Ahonen, “Automatic generation of SGML content models,” Electronic Publishing— Origination, Dissemination and Design 8(2&3), 195–206 (1995).Google Scholar
  3. 3.
    N. Ashish and C.A. Knoblock, “Semi-automatic wrapper generation for Internet information sources, ” in Second IFCIS International Conference on Cooperative Information Systems (CoopIS’ 97), Kiawah Island, SC, USA. IEEE-CS Press (1997).Google Scholar
  4. 4.
    J.K. Baker, “Trainable grammars for speech recognition,” Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, 547–550 (1979).Google Scholar
  5. 5.
    S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, “ The TSIMMIS project: integration of heterogenous information sources,” in Proceedings of the 10th Meeting of the Information Processing Society of Japan (IPSJ’ 94), 7–18. (1994).Google Scholar
  6. 6.
    W.W. Cohen, “Recognizing structure in web pages using similarity queries,” in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI’ 99), Orlando, FL, USA. AAAI Press (1999).Google Scholar
  7. 7.
    C.M. Cook, A. Rosenfeld, and A.R. Aronson, “Grammatical inference by hill climbing, ” Informational Sciences 10, 59–80 (1976).MathSciNetGoogle Scholar
  8. 8.
    M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, “Learning to construct knowledge bases from the world wide web,” Artificial Intelligence 118, 69–113 (2000).zbMATHCrossRefGoogle Scholar
  9. 9.
    R. Doorenbos, O. Etzioni, and D. Weld, “A scalable comparison-shopping agent for the world-wide web, ” in First International Conference on Autonomous Agents (Agents’ 97), Marina del Rey, CA, USA, 39–48. ACM Press (1997).Google Scholar
  10. 10.
    D. Freitag, “Using grammatical inference to improve precision in information extraction,” in ICML’ 97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition, Nashville, TN, USA. (1997).Google Scholar
  11. 11.
    X. Gao and L. Sterling, “AutoWrapper: automatic wrapper generation for multiple online services, ” in Asia Pacific Web Conference’ 99, Hong Kong. (1999).Google Scholar
  12. 12.
    R. Ghani, R. Jones, D. Mladenić, K. Nigam, and S. Slattery, “Data mining on symbolic knowledge extracted from the web,” in KDD-2000 Workshop on Text Mining, Boston, MA, USA. (2000).Google Scholar
  13. 13.
    E.M. Gold, “Language identi.cation in the limit,” Information and Control 10, 447–474 (1967).CrossRefzbMATHGoogle Scholar
  14. 14.
    T. Hong, “Visualizing real estate property information on the web,” Information Visualization’ 99. IEEE Computer Society, Los Alamitos, CA (1999).Google Scholar
  15. 15.
    B. Krulwich, “The BargainFinder agent: comparison price shopping on the Internet,” in Bots and Other Internet Beasties. Sams Publishing (1996).Google Scholar
  16. 16.
    N. Kushmerick, “Wrapper induction: e.ciency and expressiveness,” Artificial Intelligence 118, 15–68 (2000).zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Y. Sakakibara, “Recent advances of grammatical inference,” Theoretical Computer Science 185, 15–45 (1997).zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Theodore W. Hong
    • 1
  • Keith L. Clark
    • 1
  1. 1.Department of ComputingImperial College of Science, Technology, and MedicineLondonUK

Personalised recommendations