Using MDL for Grammar Induction
In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from  and  we show that:
– Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur.
– Contrary to what is suggested by the results of  there is no fundamental difference between positive and negative data from an MDL perspective.
– MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code.
These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems (). The code was in Java, using the Satin () divide-and-conquer system that runs on top of the Ibis () Grid programming environment.
KeywordsBinary String Regular Language Kolmogorov Complexity Short Code Minimum Description Length Principle
Unable to display preview. Download preview PDF.
- 2.Adriaans, P., Vitányi, P.M.B.: The power and perils of MDL, Human Computer Studies Lab, Universiteit van Amsterdam (2005)Google Scholar
- 9.Vervoort, M.: Games, walks and Grammars, Thesis University of Amsterdam (2000)Google Scholar
- 11.van Zaanen, M., Adriaans, P.: Alignment-Based Learning versus EMILE: A Comparison. In: Proceedings of the Belgian-Dutch Conference on Artificial Intelligence (BNAIC), Amsterdam, The Netherlands, pp. 315–322 (2001)Google Scholar
- 13.Curnéjols, A., Miclet, L.: Apprentissage artificiel, concepts et algorithmes, Eyrolles (2003)Google Scholar
- 16.de la Higuera, C., Adriaans, P. W., van Zaanen, M., Oncina, J.(eds.): Proceedings of the Workshop and Tutorial on Learning Context-Free Grammars held at the 14th European Conference on Machine Learning (ECML) and the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Dubrovnik, Croatia (2003)Google Scholar
- 17.van Nieuwpoort, R.V., Maassen, J., Kielmann, T., Bal, H.E.: Simple and Efficient Java-based Grid Programming. Scalable Computing: Practice and Experience 6(3), 19–32 (2005)Google Scholar