Skip to main content

Unsupervised Grammar Inference Using the Minimum Description Length Principle

  • Conference paper
  • 5666 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7376)

Abstract

Context Free Grammars (CFGs) are widely used in programming language descriptions, natural language processing, compilers, and other areas of software engineering where there is a need for describing the syntactic structures of programs. Grammar inference (GI) is the induction of CFGs from sample programs and is a challenging problem. We describe an unsupervised GI approach which uses simplicity as the criterion for directing the inference process and beam search for moving from a complex to a simpler grammar. We use several operators to modify a grammar and use the Minimum Description Length (MDL) Principle to favor simple and compact grammars. The effectiveness of this approach is shown by a case study of a domain specific language. The experimental results show that an accurate grammar can be inferred in a reasonable amount of time.

Keywords

  • grammar inference
  • context free grammar
  • domain specific language
  • minimum description length
  • unsupervised learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dupont, P.: Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: The GIG Method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 236–245. Springer, Heidelberg (1994), http://dl.acm.org/citation.cfm?id=645515.658234

    CrossRef  Google Scholar 

  2. Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)

    CrossRef  MATH  Google Scholar 

  3. de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, New York (2010)

    MATH  Google Scholar 

  4. Javed, F., Mernik, M., Bryant, B.R., Sprague, A.: An unsupervised incremental learning algorithm for domain-specific language development. Applied Artificial Intelligence 22(7), 707–729 (2008)

    CrossRef  Google Scholar 

  5. Lammel, R., Verhoef, C.: Semi-automatic grammar recovery. Software — Practice & Experience 31(15), 1395–1438 (2001)

    CrossRef  Google Scholar 

  6. Langley, P., Stromsten, S.: Learning Context-Free Grammars with a Simplicity Bias. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 220–228. Springer, Heidelberg (2000)

    CrossRef  Google Scholar 

  7. Li, M., Vitanyi, P.M.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer Publishing Company, Incorporated (2008)

    Google Scholar 

  8. Mernik, M., Hrncic, D., Bryant, B., Sprague, A., Gray, J., Liu, Q., Javed, F.: Grammar inference algorithms and applications in software engineering. In: Proceedings of ICAT 2009, the XXII International Symposium on Information, Communication and Automation Technologies, pp. 1–7 (October 2009)

    Google Scholar 

  9. Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM Comput. Surv. 37(4), 316–344 (2005), http://doi.acm.org/10.1145/1118890.1118892

    CrossRef  Google Scholar 

  10. Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7, 67–82 (1997)

    MATH  Google Scholar 

  11. Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. In: Pattern Recognition and Image Analysis, pp. 49–61 (1992)

    Google Scholar 

  12. Paakki, J.: Attribute grammar paradigms a high-level methodology in language implementation. ACM Comput. Surv. 27, 196–255 (1995), http://doi.acm.org/10.1145/210376.197409

    CrossRef  Google Scholar 

  13. Petasis, G., Paliouras, G., Karkaletsis, V., Halatsis, C., Spyropoulos, C.D.: E-grids: Computationally efficient grammatical inference from positive examples. Grammars 7 (2004)

    Google Scholar 

  14. Rissanen, J.: Stochastic Complexity in Statistical Inquiry Theory. World Scientific Publishing Co., Inc., River Edge (1989)

    Google Scholar 

  15. Tu, K., Honavar, V.: Unsupervised Learning of Probabilistic Context-Free Grammar using Iterative Biclustering. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 224–237. Springer, Heidelberg (2008)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sapkota, U., Bryant, B.R., Sprague, A. (2012). Unsupervised Grammar Inference Using the Minimum Description Length Principle. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31537-4_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31536-7

  • Online ISBN: 978-3-642-31537-4

  • eBook Packages: Computer ScienceComputer Science (R0)