Machine Learning

, Volume 66, Issue 1, pp 33–67 | Cite as

Interactive learning of node selecting tree transducer

  • Julien Carme
  • Rémi Gilleron
  • Aurélien Lemay
  • Joachim Niehren


We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction.

We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes are distinguished.

In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning NSTTs—a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1987).

We have implemented our interactive learning algorithm integrated it into a visually interactive Web information extraction system—called SQUIRREL—by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.


Web information extraction Wrapper induction Grammatical inference Tree automata Monadic queries 


  1. Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75(2), 87–106.zbMATHMathSciNetCrossRefGoogle Scholar
  2. Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with lixto. In 28th International Conference on Very Large Data Bases, pp. 119–128.Google Scholar
  3. Brüggemann-Klein, A., Wood, D., & Murata, M. (2001). Regular tree and regular hedge languages over unranked alphabets: Version 1.Google Scholar
  4. Carme, J., Gilleron, R., Lemay, A., & Niehren, J. (2005). Interactive learning of node selecting tree transducer. In IJCAI Workshop on Grammatical Inference.Google Scholar
  5. Carme, J., Lemay, A., & Niehren, J. (2004a). Learning node selecting tree transducer from completely annotated examples. In 7th International Colloquium on Grammatical Inference, Vol. 3264 of Lecture Notes in Artificial Intelligence, (pp. 91–102). Springer Verlag.Google Scholar
  6. Carme, J., Niehren, J., & Tommasi, M. (2004b). Querying unranked trees with stepwise tree automata. In 19th International Conference on Rewriting Techniques and Applications, Vol. 3091 of Lecture Notes in Computer Science, (pp. 105–118). Springer Verlag.Google Scholar
  7. Chidlovskii, B. (2001). Wrapping web information providers by transducer induction. In Proc. European Conference on Machine Learning, Vol. 2167 of Lecture Notes in Artificial Intelligence, pp. 61–73.Google Scholar
  8. Cohen, W., Hurst, M., & Jensen, L. (2003). Web document analysis: challenges and opportunities, chap. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. World Scientific.Google Scholar
  9. Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., & Tommasi, M. (1997). Tree automata techniques and applications. Available on:
  10. Cristau, J., Löding, C., & Thomas, W. (2005). Deterministic automata on unranked trees. In 15th International Symposium on Fundamentals of Computation Theory. To Appear.Google Scholar
  11. de la Higuera, C. (1997). Characteristic sets for polynomial grammatical inference. Machine Learning, 27, 125–137.zbMATHCrossRefGoogle Scholar
  12. Drewes, F., & Hogberg, J. (2003). Learning a regular tree language from a teacher. In D.L.T. 2003, Vol. 2710 of Lecture Notes in Computer Science, (pp. 279–291).Google Scholar
  13. Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In AAAI/IAAI, (pp. 577–583).Google Scholar
  14. Freitag, D., & McCallum, A. K. (1999). Information extraction with hmms and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.Google Scholar
  15. Frick, M., Grohe, M., & Koch, C. (2003). Query evaluation on compressed trees. In 18th IEEE Symposium on Logic in Computer Science, (pp. 188–197).Google Scholar
  16. Gold, E. (1967). Language identification in the limit. Inform. Control, 10, 447–474.zbMATHCrossRefGoogle Scholar
  17. Gold, E. (1978). Complexity of automaton identification from given data. Inform. Control, 37, 302–320.zbMATHMathSciNetCrossRefGoogle Scholar
  18. Gottlob, G., & Koch, C. (2002). Monadic queries over tree-structured data. In 17th Annual IEEE Symposium on Logic in Computer Science, (pp. 189–202) Copenhagen.Google Scholar
  19. Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8), 521–538.CrossRefGoogle Scholar
  20. Kermorvant, C., & de la Higuera, C. (2002). Learning language with help. In 6th International Colloquium on Grammatical Inference, Vol. 2484 of Lecture Notes in Artificial Intelligence, (pp. 161–173). Springer Verlag.Google Scholar
  21. Kosala, R., Bruynooghe, M., Van den Bussche, J., & Blockeel, H. (2003). Information extraction from web documents based on local unranked tree automaton inference. In 18th International Joint Conference on Artificial Intelligence, (pp. 403–408). Morgan Kaufmann.Google Scholar
  22. Kushmerick, N. (1997). Wrapper Induction for Information Extraction. Ph.D. thesis, University of Washington.Google Scholar
  23. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.zbMATHMathSciNetCrossRefGoogle Scholar
  24. Kushmerick, N. (2002). Finite-state approaches to web information extraction. In Proc. 3rd Summer Convention on Information Extraction.Google Scholar
  25. Lang, K. J., Pearlmutter, B. A., & Price, R. A. (1998). Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. Lecture Notes in Computer Science, 1433, 1–12.Google Scholar
  26. Lang, K. (1992). Random DFA’s can be approximately learned from sparse uniform examples. In Proc. 5th Annu. Workshop on Comput. Learning Theory, (pp. 45–52). ACM Press, New York, NY.Google Scholar
  27. Libkin, L. (2005). Logics over unranked trees: An overview. In Automata, Languages and Programming: 32nd International Colloquium, No. 3580 in Lecture Notes in Computer Science, (pp. 35–50). Springer Verlag.Google Scholar
  28. Martens, W., & Niehren, J. (2006). On the minimization of XML schemas and tree automata for unranked trees. Journal of Computer and System Science. Special issue of DBPL’05.Google Scholar
  29. Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2), 93–114.CrossRefGoogle Scholar
  30. Neumann, A., & Seidl, H. (1998). Locating matches of tree patterns in forests. In Foundations of Software Technology and Theoretical Computer Science, (pp. 134–145).Google Scholar
  31. Neven, F., & Schwentick, T. (2002). Query automata over finite trees. Theoretical Computer Science, 275(1–2), 633–674.zbMATHMathSciNetCrossRefGoogle Scholar
  32. Neven, F., & Van Den Bussche, J. (2002). Expressiveness of structured document query languages based on attribute grammars. Journal of the ACM, 49(1), 56–100.MathSciNetCrossRefGoogle Scholar
  33. Niehren, J., Planque, L., Talbot, J.-M., & Tison, S. (2005). N-ary queries by tree automata. In 10th International Symposium on Database Programming Languages, Vol. 3774 of Lecture Notes in Computer Science, (pp. 217–231). Springer Verlag.Google Scholar
  34. Oncina, J., & García, P. (1992). Inferring regular languages in polynomial update time. In Pattern Recognition and Image Analysis, (pp. 49–61).Google Scholar
  35. Oncina, J., & García, P. (1993). Inference of recognizable tree sets. Tech. rep., Departamento de Sistemas Informáticos y Computación, Universidad de Alicante. DSIC-II/47/93.Google Scholar
  36. Raeymaekers, S., & Bruynooghe, M. (2004). Minimization of finite unranked tree automata. Manuscript.Google Scholar
  37. Raeymaekers, S., Bruynooghe, M., & Van den Bussche, J. (2005). Learning (k,l)-contextual tree languages for information extraction. In Proceedings of ECML’2005, Vol. 3720 of Lecture Notes in Artificial Intelligence, (pp. 305–316).Google Scholar
  38. Sakakibara, Y. (1990). Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76, 223–242.zbMATHMathSciNetCrossRefGoogle Scholar
  39. Seidl, H. (1989). On the finite degree of ambiguity of finite tree automata. Acta Informatica, 26(6), 527–542.zbMATHMathSciNetCrossRefGoogle Scholar
  40. Thatcher, J. W. (1967). Characterizing derivation trees of context-free grammars through a generalization of automata theory. Journal of Computer and System Science, 1, 317–322.zbMATHMathSciNetGoogle Scholar
  41. Thatcher, J. W., & Wright, J. B. (1968). Generalized finite automata with an application to a decision problem of second-order logic. Mathematical System Theory, 2, 57–82.zbMATHMathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Julien Carme
    • 1
  • Rémi Gilleron
    • 1
  • Aurélien Lemay
    • 1
  • Joachim Niehren
    • 1
  1. 1.Mostrare projectINRIA Futurs and Lille University, LIFLLille

Personalised recommendations