Machine Learning

, Volume 40, Issue 2, pp 111–137 | Cite as

Stochastic Grammatical Inference of Text Database Structure

  • Matthew Young-Lai
  • Frank WM. Tompa


For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on stochastic grammatical inference which scales more naturally to large data sets and produces models with richer semantics. We adopt an algorithm that produces stochastic finite automata and describe modifications that enable better interactive control of results. Our experimental evaluation uses four document collections with varying structure.

stochastic grammatical inference text database structure 


  1. Ahonen, H. (1996). Generating grammars for structured documents using grammatical inference methods. Ph.D. Thesis, University of Helsinki, Department of Computer Science, Technical Report A-1996–4.Google Scholar
  2. Ahonen, H., Mannila, H., & Nikunen, E. (1994a). Forming grammars for structured documents: An application of grammatical inference. In Carrasco and Oncina (1994a), pp. 153–167.Google Scholar
  3. Ahonen, H., Mannila, H., & Nikunen, E. (1994b). Generating grammars for SGML tagged texts lacking DTD. In M. Murata & H. Gallaire (Eds.), Proc. Workshop on Principles of Document Processing (PODP), Darmstadt.Google Scholar
  4. André, J., Furuta, R., & Quint, V. (1989). Structured documents. The Cambridge Series on Electronic Publishing. Cambridge University Press.Google Scholar
  5. Berg, D. L. (1989). The research potential of the electronic OED2 database at the University of Waterloo: a guide for scholars. Technical Report OED–89–02, UW Centre for the New Oxford English Dictionary, Waterloo, Ontario. available at Scholar
  6. Berg, D. L. (1993). A guide to the Oxford English dictionary: The essential companion and user's guide, Oxford University Press, Oxford.Google Scholar
  7. Bray, T., Paoli, J., & Sperberg-McQueen, C. M. (1998). Extensible markup language (XML) 1.0. w3c recommendation.Google Scholar
  8. Carrasco, R. & Oncina, J. (Eds.). (1994a). Lecture Notes in Computer Science, Vol. 862, Springer-Verlag.Google Scholar
  9. Carrasco, R. C. & Oncina, J. (1994b). Learning stochastic regular grammars by means of a state merging method. In Carrasco and Oncina (1994a), pp. 139–152.Google Scholar
  10. Chen, J. (1991). Grammar generation and query processing for text databases. Research proposal, University of Waterloo.Google Scholar
  11. Clark, D. (1991). Finite state transduction tools. Technical Report OED–91–03, UW Centre for the New Oxford English Dictionary and Text Research, Waterloo, Ontario.Google Scholar
  12. Coombs, J. H., Renear, A. H., & DeRose, S. J. (1987). Markup systems and the future of scholarly text processing. Communications of the ACM, 30(11), 933–947.Google Scholar
  13. Desu, M. (1990). Sample size methodology. Boston: Academic Press.Google Scholar
  14. Fankhauser, P. & Xu, Y. (1993). MarkItUp! an incremental approach to document structure recognition. Electronic Publishing-Origin, Dissemination and Design, 6(4), 447–456.Google Scholar
  15. Fröhlich, M. & Werner, M. (1995). da Vinci 1.4 User Manual. available from Scholar
  16. Fu, K. S. (1982). Syntactic pattern recognition and applications. Englewood Cliffs, N.J: Prentice-Hall.Google Scholar
  17. Goldman, R. & Widom, J. (1997). DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the Twenty-Third International Conference onVery Large Databases (pp. 436–445). Athens, Greece.Google Scholar
  18. Gonnet, G. & Tompa, F. W. (1987). Mind your grammar: a new approach to modelling text. Very Large Data Bases (VLDB), 13, 339–346.Google Scholar
  19. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. American Statistical Association Journal, 58, 13–30.Google Scholar
  20. Holte, R. C., Acker, L. E., & Porter, B.W. (1989). Concept learning and the problem of small disjuncts, Sridharan, N. (Ed.), Proceedings of the Eleventh International Joint Conference on Artificial Intelligence V. 1, Detroit, Michigan.Google Scholar
  21. Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to automata theory, languages and computation. Reading, Massachusetts: Addison-Wesley.Google Scholar
  22. Hornby, A. S. & Cowie, A. P. (1992). Oxford advanced learner's dictionary of current English 4th edn. Oxford: Oxford University Press.Google Scholar
  23. Huang, X., Ariki, Y., & Jack, M. (1990). Hidden markov models for speech recognition. Edinburgh University Press.Google Scholar
  24. ISO (1986). Information processing—text and office systems—standard generalized markup language (SGML).Google Scholar
  25. Kazman, R. (1986). Structuring the text of the Oxford English dictionary through finite state transduction. Technical Report CS–86–20, University of Waterloo, Computer Science Department.Google Scholar
  26. Kilpeläinen, P., Lindén, G., Mannila, H., & Nikunen, E. (1990). A structured document database system. In Furuta, R. (Ed.), In EP90-Proceedings of the International Conference on Electronic Publishing, Document Manipulation & Typography, The Cambridge Series on Electronic Publishing. Cambridge University Press.Google Scholar
  27. Krogh, C. M. (Ed.). (1995). Compendium of pharmaceuticals and specialties, 30th edn. Canadian Pharmaceutical Association.Google Scholar
  28. Lalonde, W. (1977). Regular right part grammars and their parsers. Communications of the ACM, 20(10), 731–741.Google Scholar
  29. Lang, K. J., Pearlmutter, B. A., & Price, R. A. (2000). Results of the Abbadingo one DFA learning competition and a new evidence driven state merging algorithm To appear.Google Scholar
  30. Lari, K. & Young, S. (1990). The estimation of stochastic context-free grammars using the inside-outside algorithm. Comp. Speech and Language, 4, 34–36.Google Scholar
  31. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., amp; Widom, J. (1997). Lore: a database management system for semistructured data. SIGMOD Record, 26(3), 54–66.Google Scholar
  32. Muslea, I., Minton, S., & Knoblock, C. (1998). Wrapper induction for semistructured, web-based information sources. In Proceedings of the Conference on Automatic Learning and Discovery (CONALD-98).Google Scholar
  33. Oncina, J. & García, P. (1992). Inferring regular languages in polynomial updated time. In N. P. de la Blanca, A. Sanfeliu, & E. Vidal (Eds.), Pattern Recognition and Image Analysis, pp. 49–61. World Scientific. Oxford University Press (1998). Inside the Oxford English Dictionary. Scholar
  34. Raymond, D. R. (1993). Visualizing text. In 9th Annual Conference of the UW Centre for the New OED, Oxford, England. available at∼drraymon/papers/ Scholar
  35. Ron, D., Singer, Y., & Tishby, N. (1995). On the learnability and usage of acyclic probabilistic automata. In Computational Learning Theory, COLT 95 (pp. 31–40).Google Scholar
  36. Sánchez, J. A. & Benedí, J. M. (1994). Statistical inductive learning of regular formal languages. In Carrasco and Oncina (1994a), pp. 130–138.Google Scholar
  37. Shafer, K. (1995). Creating DTDs via the GB-engine and Fred. OCLC Fred web page Scholar
  38. Simpson, J. & Weiner, E. (Eds.). (1989). The Oxford English dictionary, 2nd edn. Oxford: Clarendon Press.Google Scholar
  39. Stolcke, A. & Omohundro, S. (1994). Inducing probabilistic grammars by Bayesian model merging. In Carrasco and Oncina (1994a), pp. 106–118.Google Scholar
  40. Suciu, D. (Ed.). (1997). In Proc. of the Workshop on Managment of Semi-structured Data (PODS/SIGMOD), Tucson, Arizona. National Science Foundation. available at∼suciu/workshop-papers.html.Google Scholar
  41. Vidal, E. (1994). Grammatical inference: an introductory survey. In Carrasco and Oncina (1994a), pp. 1–4.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Matthew Young-Lai
    • 1
  • Frank WM. Tompa
    • 2
  1. 1.Computer Science DepartmentUniversity of WaterlooWaterlooCanada
  2. 2.Computer Science DepartmentUniversity of WaterlooWaterlooCanada

Personalised recommendations