Evolving Culture Versus Local Minima

  • Yoshua Bengio
Part of the Studies in Computational Intelligence book series (SCI, volume 557)


We propose a theory that relates difficulty of learning in deep architectures to culture and language. It is articulated around the following hypotheses: (1) learning in an individual human brain is hampered by the presence of effective local minima; (2) this optimization difficulty is particularly important when it comes to learning higher-level abstractions, i.e., concepts that cover a vast and highly-nonlinear span of sensory configurations; (3) such high-level abstractions are best represented in brains by the composition of many levels of representation, i.e., by deep architectures; (4) a human brain can learn such high-level abstractions if guided by the signals produced by other humans, which act as hints or indirect supervision for these high-level abstractions; and (5), language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of communicable ideas that help humans build up better high-level internal representations of their world. These hypotheses put together imply that human culture and the evolution of ideas have been crucial to counter an optimization difficulty: this optimization difficulty would otherwise make it very difficult for human brains to capture high-level knowledge of the world. The theory is grounded in experimental observations of the difficulties of training deep artificial neural networks. Plausible consequences of this theory for the efficiency of cultural evolution are sketched.


Cultural Evolution Synaptic Strength Deep Neural Network Stochastic Gradient Descent Learning Agent 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The author would like to thank Caglar Gulcehre, Aaron Courville, Myriam Côté, and Olivier Delalleau for useful feedback, as well as NSERC, CIFAR and the Canada Research Chairs for funding.


  1. 1.
    D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 (1985)CrossRefGoogle Scholar
  2. 2.
    M.A. Arbib, The Handbook of Brain Theory and Neural Networks (MIT Press, Cambridge, 1995)Google Scholar
  3. 3.
    Y. Bengio, Learning deep architectures for AI. Found. Trends Mach. Lear. 2(1), 1–127 2009. Also published as a book. Now Publishers, 2009Google Scholar
  4. 4.
    Y. Bengio, O. Delalleau, On the expressive power of deep architectures, in Proceedings of the 22nd International Conference on Algorithmic Learning Theory, 2011, ed. by J. Kivinen, C. Szepesvári, E. Ukkonen, T. ZeugmannGoogle Scholar
  5. 5.
    Y. Bengio, O. Delalleau, C. Simard, Decision trees do not generalize to new variations. Comput. Intell. 26(4), 449–467 (2010)Google Scholar
  6. 6.
    Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in Advances in Neural Information Processing Systems 19 (NIPS’06), ed. by B. Schölkopf, J. Platt, T. Hoffman (MIT Press, Cambridge, 2007), pp. 153–160Google Scholar
  7. 7.
    Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI. in Large Scale Kernel Machines, ed. by L. Bottou, O. Chapelle, D. DeCoste, J. Weston (MIT Press, Cambridge, 2007)Google Scholar
  8. 8.
    Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning, in Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML’09), ed. by L. Bottou, M. Littman (ACM, 2009)Google Scholar
  9. 9.
    L. Bottou, Stochastic learning, in Advanced Lectures on Machine Learning, number LNAI 3176 in Lecture notes in artificial intelligence, ed. by O. Bousquet, U. von Luxburg (Springer, Berlin, 2004), pp. 146–168CrossRefGoogle Scholar
  10. 10.
    M.A. Carreira-Perpiñan, G.E. Hinton, On contrastive divergence learning, in Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS’05), ed. by R.G. Cowell, Z. Ghahramani (Society for Artificial Intelligence and Statistics, 2005) pp. 33–40.Google Scholar
  11. 11.
    R. Caruana, Multitask connectionist learning, in Proceedings of the 1993 Connectionist Models Summer School, 1993, pp. 372–379Google Scholar
  12. 12.
    R. Dawkins, The Selfish Gene (Oxford University Press, London, 1976)Google Scholar
  13. 13.
    K. Distin, The Selfish Meme (Cambridge University Press, London, 2005)Google Scholar
  14. 14.
    J.L. Elman, Learning and development in neural networks: the importance of starting small. Cognition 48, 781–799 (1993)CrossRefGoogle Scholar
  15. 15.
    D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pre-training help deep learning? J. Mach. Lear. Res. 11, 625–660 (2010)Google Scholar
  16. 16.
    J. Håstad, Almost optimal lower bounds for small depth circuits, in Proceedings of the 18th annual ACM Symposium on Theory of Computing (ACM Press, Berkeley, 1986), pp. 6–20Google Scholar
  17. 17.
    J. Håstad, M. Goldmann, On the power of small-depth threshold circuits. Comput. Complex. 1, 113–129 (1991)CrossRefzbMATHGoogle Scholar
  18. 18.
    G.E. Hinton, T.J. Sejnowski, D.H. Ackley, Boltzmann machines: constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, (Dept. of Computer Science, Carnegie-Mellon University, 1984)Google Scholar
  19. 19.
    G.E. Hinton, Learning distributed representations of concepts, in Proceedings of the Eighth Annual Conference of the Cognitive Science Society (Lawrence Erlbaum, Hillsdale, Amherst 1986, 1986), pp. 1–12Google Scholar
  20. 20.
    G.E. Hinton, Connectionist learning procedures. Artif. Intell. 40, 185–234 (1989)CrossRefGoogle Scholar
  21. 21.
    G.E. Hinton, S.J. Nowlan, How learning can guide evolution. Complex Syst. 1, 495–502 (1989)Google Scholar
  22. 22.
    G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  23. 23.
    G.E. Hinton, R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)Google Scholar
  24. 24.
    J.H. Holland, Adaptation in Natural and Artificial Systems (University of Michigan Press, Ann Arbor, 1975)Google Scholar
  25. 25.
    E. Hutchins, B. Hazlehurst, How to invent a lexicon: the development of shared symbols in interaction, in Artificial Societies: The Computer Simulation of Social Life, ed. by N. Gilbert, R. Conte (UCL Press, London, 1995), pp. 157–189Google Scholar
  26. 26.
    E. Hutchins, B. Hazlehurst, Auto-organization and emergence of shared language structure, in Simulating the Evolution of Language, ed. by A. Cangelosi, D. Parisi (Springer, London, 2002), pp. 279–305CrossRefGoogle Scholar
  27. 27.
    K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object recognition? in Proceedings of IEEE International Conference on Computer Vision (ICCV’09), 2009, pp. 2146–2153Google Scholar
  28. 28.
    F. Khan, X. Zhu, B. Mutlu, How do humans teach: on curriculum learning and teaching dimension, in Advances in Neural Information Processing Systems 24 (NIPS’11), 2011 pp. 1449–1457Google Scholar
  29. 29.
    K.A. Krueger, P. Dayan, Flexible shaping: how learning in small steps helps. Cognition 110, 380–394 (2009)CrossRefGoogle Scholar
  30. 30.
    H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, Exploring strategies for training deep neural networks. J. Mach. Lear. Res. 10, 1–40 (2009)zbMATHGoogle Scholar
  31. 31.
    H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, in Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML’09), ed. by L. Bottou, M. Littman (ACM, Montreal (Qc), Canada, 2009)Google Scholar
  32. 32.
    J. Martens. Deep learning via Hessian-free optimization, in Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML-10), ed. by L. Bottou, M. Littman (ACM, 2010) pp. 735–742Google Scholar
  33. 33.
    E. Moritz, Memetic science: I–general introduction. J. Ideas 1, 1–23 (1990)Google Scholar
  34. 34.
    G.B. Peterson, A day of great illumination: B. F. Skinner’s discovery of shaping. J. Exp. Anal. Behav. 82(3), 317–328 (2004)CrossRefGoogle Scholar
  35. 35.
    R. Raina, A. Battle, H. Lee, B. Packer, A.Y. Ng, Self-taught learning: transfer learning from unlabeled data, in Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML’07), ed. by Z. Ghahramani (ACM, 2007), pp. 759–766Google Scholar
  36. 36.
    M. Ranzato, C. Poultney, S. Chopra, Y. LeCun, Efficient learning of sparse representations with an energy-based model, in Advances in Neural Information Processing Systems 19 (NIPS’06), ed. by B. Schölkopf, J. Platt, T. Hoffman (MIT Press, 2007) pp. 1137–1144Google Scholar
  37. 37.
    D.E. Rumelhart, J.L. McClelland, and the PDP Research Group Parallel Distributed Processing Explorations in the Microstructure of Cognition, (MIT Press, Cambridge, 1986)Google Scholar
  38. 38.
    R. Salakhutdinov, G.E. Hinton, Deep Boltzmann machines, in Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), vol. 8, 2009Google Scholar
  39. 39.
    R. Salakhutdinov, G.E. Hinton, Deep Boltzmann machines. in Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS’09), vol. 5, 2009, pp. 448–455Google Scholar
  40. 40.
    R. Salakhutdinov, H. Larochelle, Efficient learning of deep Boltzmann machines, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR W&CP, vol. 9, 2010, pp. 693–700Google Scholar
  41. 41.
    B.F. Skinner, Reinforcement today. Am. Psychol. 13, 94–99 (1958)CrossRefGoogle Scholar
  42. 42.
    F. Subiaul, J. Cantlon, R.L. Holloway, H.S. Terrace, Cognitive imitation in rhesus macaques. Science 305(5682), 407–410 (2004)CrossRefGoogle Scholar
  43. 43.
    R. Sutton, A. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998)Google Scholar
  44. 44.
    J. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)Google Scholar
  45. 45.
    L. van der Maaten, G.E. Hinton, Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)Google Scholar
  46. 46.
    J. Weston, F. Ratle, R. Collobert, Deep learning via semi-supervised embedding, in Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), ed. by W.W. Cohen, A. McCallum, S.T. Roweis (ACM, New York, NY, USA, 2008), pp. 1168–1175Google Scholar
  47. 47.
    A. Yao, Separating the polynomial-time hierarchy by oracles, in Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, 1985, pp. 1–10Google Scholar
  48. 48.
    A.L. Yuille, The convergence of contrastive divergences, in Advances in Neural Information Processing Systems 17 (NIPS’04), ed. by L.K. Saul, Y. Weiss, L. Bottou (MIT Press, 2005) pp. 1593–1600Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.CIFAR Fellow, Department of computer science and operations researchUniversity of MontréalMontrealCanada

Personalised recommendations