Learning (k,l)-Contextual Tree Languages for Information Extraction

  • Stefan Raeymaekers
  • Maurice Bruynooghe
  • Jan Van den Bussche
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)


This paper introduces a novel method for learning a wrapper for extraction of text nodes from web pages based upon (k,l)-contextual tree languages. It also introduces a method to learn good values of k and l based on a few positive and negative examples. Finally, it describes how the algorithm can be integrated in a tool for information extraction.


Information Extraction Target Node Tree Automaton Tree Language Tree Transducer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ahonen, H.: Generating grammars for structured documents using grammatical inference methods. PhD thesis, University of Helsinki, Department of Computer Science (1996)Google Scholar
  2. 2.
    Angluin, D.: Inference of reversible languages. Journal of the ACM (JACM) 29(3), 741–765 (1982)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Angluin, D.: Queries and concept-learning. Machine Learning 2, 319–342 (1988)Google Scholar
  4. 4.
    Carme, J., Lemay, A., Niehren, J.: Learning node selecting tree transducer from completely annotated examples. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 91–102. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Chidlovskii, B., Ragetli, J., de Rijke, M.: Wrapper generation via grammar induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  6. 6.
    Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Innovative Applications of AI Conference, pp. 577–583. AAAI Press, Menlo Park (2000)Google Scholar
  7. 7.
    Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)Google Scholar
  8. 8.
    García, P.: Learning k-testable tree sets from positive data. Technical report, Technical Report DSIC-ii-1993-46, DSIC, Universidad Politecnica de Valencia (1993)Google Scholar
  9. 9.
    García, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990)CrossRefGoogle Scholar
  10. 10.
    Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)zbMATHCrossRefGoogle Scholar
  11. 11.
    Knuutila, T.: Inference of k-testable tree languages. In: Bunke, H. (ed.) Advances in Structural and Syntactic Pattern Recognition: Proc. of the Intl. Workshop, pp. 109–120. World Scientific, Singapore (1993)Google Scholar
  12. 12.
    Kosala, R., Bruynooghe, M., Blockeel, H., den Bussche, J.V.: Information extraction from web documents based on local unranked tree automaton inference. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 403–408 (2003)Google Scholar
  13. 13.
    Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information extraction in structured documents using tree automata induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)Google Scholar
  15. 15.
    McNaughton, R.: Algebraic decision procedures for local testability. Math. Systems Theory 8(1), 60–76 (1974)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Muggleton, S.: Inductive Acquisition of Expert Knowledge. Addison-Wesley, Reading (1990)Google Scholar
  17. 17.
    Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)CrossRefGoogle Scholar
  18. 18.
    Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: A case study on wrapper induction. In: Intl. Joint Conference on Artificial Intelligence, IJCAI (2003)Google Scholar
  19. 19.
    Raeymaekers, S., Bruynooghe, M.: Extracting information from structured documents with automata in a single run. In: Proc. 2nd Int. Workshop on Mining Graphs, Trees and Sequences (MGTS 2004), Pisa, Italy, pp. 71–82. University of Pisa (2004)Google Scholar
  20. 20.
    Rico-Juan, J.R., Calera-Rubio, J., Carrasco, R.C.: Probabilistic k-testable tree languages. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 221–228. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  21. 21.
    Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Stefan Raeymaekers
    • 1
  • Maurice Bruynooghe
    • 1
  • Jan Van den Bussche
    • 2
  1. 1.Dept. of Computer ScienceK.U.LeuvenLeuven
  2. 2.Dept. Theoretical Computer ScienceUniversiteit HasseltDiepenbeek

Personalised recommendations