Foundations of Science

, Volume 5, Issue 2, pp 185–207 | Cite as

Explanatory and Creative Alternatives to the MDL priciple

  • José Hernández-Orallo
  • Ismael García-Varea


The Minimum Description Length (MDL) principle is the modernformalisation of Occam's razor. It has been extensively and successfullyused in machine learning (ML), especially for noisy and long sources ofdata. However, the MDL principle presents some paradoxes andinconveniences. After discussing all these, we address two of the mostrelevant: lack of explanation and lack of creativity. We present newalternatives to address these problems. The first one, intensionalcomplexity, avoids extensional parts in a description, so distributingcompression ratio in a more even way than the MDL principle. The secondone, information gain, forces that the hypothesis is informative (orcomputationally hard to discover) wrt. the evidence, so giving a formaldefinition of what is to discover.

creativity explanatory induction informativeness intensional complexity machine learning MDL principle model evaluation Occam's Razor scientific and knowledge discovery 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Angluin, D.: 1988, Queries and Concept Learning. Machine Learning 2(4): 319–342.Google Scholar
  2. Barker, S.F.: 1957, Induction and Hypothesis. Ithaca.Google Scholar
  3. Bar-Hillel, Y. and R. Carnap: 1953, Semantic Information. British J. for the Philosophy of Science 4: 147–157.CrossRefGoogle Scholar
  4. Barron, A., J. Rissanen and B. Yu: 1998, TheMinimum Description Length Principle in Coding andModeling. IEEE Transactions on Information Theory 44(6): 2743–2760.CrossRefGoogle Scholar
  5. Blum, M.: 1967, A Machine-Independent Theory of the Complexity of Recursive functions, J. ACM 14(4): 322–326.CrossRefGoogle Scholar
  6. Blum, L. and M. Blum: 1975, Towards a Mathematical Theory of Inductive Inference. Inform. and Control 28: 125–155.CrossRefGoogle Scholar
  7. Blumer, A., A. Ehrenfeucht, D. Haussler and M. Warmuth: 1989, Learnability and the Vapnik-Chervonenkis Dimension. Journal of ACM 36: 929–965.CrossRefGoogle Scholar
  8. Board, R. and L. Pitt: 1990, On the Necessity of Occam Algorithms, in Proc., 22nd ACM Symp. Theory of Comp.Google Scholar
  9. Bosch, van den: 1994, Simplicity and Prediction, Master Thesis, dep. of Science, Logic & Epistemology of the Faculty of Philosophy at the Univ. of Groningen.Google Scholar
  10. Case J. and C. Smith: 1983, Comparison of Identification Criteria for Machine Inductive Inference. Theoret. Comput. Sci. 25: 193–220.CrossRefGoogle Scholar
  11. Cheeseman, P.: 1990, On Finding the Most Probable Model. In J. Shrager and P. Langley (eds.), Computational Models of Scientific Discovery and Theory Formation. Morgan Kaufmann.Google Scholar
  12. Conklin, D. and I.H. Witten: 1994, Complexity-Based Induction. Machine Learning 16: 203–225.Google Scholar
  13. Derthick, M.: 1990, The Minimum Description Length Principle Applied to Feature Learning and Analogical Mapping, MCC Tech. Rep. no. ACT-CYC-234-90.Google Scholar
  14. Ernis, R.: 1968, Enumerative Induction and Best Explanation. J. Philosophy LXV(18): 523–529.Google Scholar
  15. Freivalds, R., E. Kinber and C.H. Smith: 1995, On the Intrinsic Complexity of Learning. Information and Control 123: 64–71.Google Scholar
  16. Gold, E.M.: 1967, Language Identification in the Limit. Information & Control 10: 447–474.CrossRefGoogle Scholar
  17. Grünwald, P.: 1999, Model Selection Based on Minimum Description Length, submitted to Journal of Mathematical Psychology. Amsterdam: CWI.Google Scholar
  18. Gull, S.F.: 1988, Bayesian Inductive Inference and Maximum Entropy. In G.J. Erickson and C.R. Smith (eds.), Maximum Entropy and Bayesian Methods in Science and Engineering Vol. 1 Foundations. Dordrecht: Kluwer, 53–74.CrossRefGoogle Scholar
  19. Harman, G.: 1965, The Inference to the Best Explanation. Philos. Review 74: 88–95.CrossRefGoogle Scholar
  20. Hempel, C.G.: 1965, Aspects of Scientific Explanation. New York: The Free Press.Google Scholar
  21. Hernandez-Orallo, J.: 1999a, Constructive Reinforcement Learning, International Journal of Intelligent Systems, vol. 15, no. 3, pp. 241–264, 2000.CrossRefGoogle Scholar
  22. Hernandez-Orallo, J.: 1999b, What is a subprogram?, submitted.Google Scholar
  23. Hernandez-Orallo, J. and I. Garcia-Varea: 1998, Distinguishing Abduction and Induction Under Intensional Complexity. In A.I. Flach and P.A. Kakas (eds.), Proceedings of the ECAI'98 Workshop on Abduction and Induction Brighton, 41–48.Google Scholar
  24. Hintikka, J., 1970, Surface Information and Depth Information. In J. Hintikka and P. Suppes (eds.), Information and Inference. D. Reidel Publishing Company, 263–297.Google Scholar
  25. Kearns, M., Y. Mansour, A.Y. Ng and D. Ron: 1999, An Experimental and Theoretical Comparison of Model Selection Methods. Machine Learning, to appear.Google Scholar
  26. Kuhn, T.S.: 1970, The Structure of Scientific Revolutions. University of Chigago.Google Scholar
  27. Levin, L.A.: 1973, Universal Search Problems. Problems Inform. Transmission 9: 265–266.Google Scholar
  28. Li, M. and P. Vitanyi: 1997, An Introduction to Kolmogorov Complexity and its Applications, 2nd Ed. Springer-Verlag.Google Scholar
  29. Merhav, N. and M. Feder: 1998, Universal Prediction. IEEE Transactions on Information Theory 44(6): 2124–2147.CrossRefGoogle Scholar
  30. Muggleton, S., A. Srinivasan and M. Bain: 1992, Compression, Significance and Accuracy. In D. Sleeman and P. Edwards (eds.), Machine Learning: Proc. of the 9th Intl Conf (ML92), Wiley, 523–527.Google Scholar
  31. Muggleton, S. and L. De Raedt: 1994, Inductive Logic Programming - theory and methods. J. of Logic Prog. 19-20: 629–679.CrossRefGoogle Scholar
  32. Pfahringer, B.: 1994, Controlling Constructive Induction in CiPF: an MDL Approach. In F. Bergadano and L. de Raedt (eds.), Machine Learning, Proc. of the European Conf. on Machine Learning (ECML-94), LN AI 784, Springer-Verlag, 242–256.Google Scholar
  33. Popper, K.R.: 1962, Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.Google Scholar
  34. Quinlan, J. and R. Rivest: 1989, Inferring Decision Trees Using the Minimum Description Length Principle. Information and Computation 80: 227–248.CrossRefGoogle Scholar
  35. Rissanen, J.: 1978, Modeling by the Shortest Data Description. Automatica-J.IFAC 14: 465–471.CrossRefGoogle Scholar
  36. Rissanen, J.: 1986, Stochastic Complexity and Modeling. Annals Statist. 14: 1080–1100.CrossRefGoogle Scholar
  37. Rissanen, J.: 1996, Fisher Information and Stochastic Complexity. IEEE Trans. on Information Theory 42(1).Google Scholar
  38. Rivest, R.L. and R. Sloan: 1994,A Formal Model of Hierarchical Concept Learning. Inf. and Comp. 114: 88–114.CrossRefGoogle Scholar
  39. Schaffer, C.: 1994, A Conservation Law for Generalization Performance, in Proc. of the 11th Intl. Conf. on Machine Learning, 259–265.Google Scholar
  40. Sharger, J. and P. Langley: 1990, Computational Models of Scientific Discovery and Theory Formation. Morgan Kaufmman.Google Scholar
  41. Solomonoff, R.J.: 1964, A Formal Theory of Inductive Inference, Inf. Control 7: 1-22, Mar., 224–254, June.CrossRefGoogle Scholar
  42. Solomonoff, R.J.: 1978, Complexity-Based Induction Systems: Comparisons and Convergence Theorems. IEEE Trans. Inform. Theory IT-24: 422–432.CrossRefGoogle Scholar
  43. Valiant, L.: 1984, A Theory of the Learnable. Comm. of the ACM 27(11): 1134–1142.CrossRefGoogle Scholar
  44. Vitányi, P. and M. Li: 1996, Minimum Description Length Induction, bayesianism, and Kolmogorov complexity. Manuscript, CWI, Amsterdam, September 1996, Submitted to: IEEE Trans. Inform. Theory. URL: http://www. Scholar
  45. Vitányi, P. and M. Li: 1997, On Prediction by Data Compression, in: Proc. of the 9th European Conf. on Machine Learning, LNAI 1224, Springer-Verlag, 14–30.Google Scholar
  46. Wallace, C.S. and D.M. Boulton: 1968, An Information Measure for Classification. Computing Journal 11: 185–195.CrossRefGoogle Scholar
  47. Watanabe, S.: 1972, Pattern Recognition as Information Compression. In Watanabe (ed.), Frontiers of Pattern Recognition. New York: Academic Press.Google Scholar
  48. Wolff, J.G.: 1995, Computing as Compression: An Overview of the SP Theory and System. New Gen. Computing 13: 187–214.CrossRefGoogle Scholar
  49. Wolpert, D.: 1992, On the Connection Between In-sample Testing and Generalization Error. Complex Systems 6: 47–94.Google Scholar
  50. Zemel, R.: 1993, A Minimum Description Length Framework for Unsupervised Learning. Ph.D. Thesis, Dept. of Computer Science, Univ. of Toronto.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • José Hernández-Orallo
    • 1
  • Ismael García-Varea
    • 2
  1. 1.Departament de Sistemes Informàtics i ComputacióUniversitat Politècnica de ValènciaValènciaSpain
  2. 2.Institut Tecnològic d'InformàticaUniversitat Politècnica de ValènciaValènciaSpain

Personalised recommendations