A good model of object shape is essential in applications such as segmentation, detection, inpainting and graphics. For example, when performing segmentation, local constraints on the shapes can help where object boundaries are noisy or unclear, and global constraints can resolve ambiguities where background clutter looks similar to parts of the objects. In general, the stronger the model of shape, the more performance is improved. In this paper, we use a type of deep Boltzmann machine (Salakhutdinov and Hinton, International Conference on Artificial Intelligence and Statistics, 2009) that we call a Shape Boltzmann Machine (SBM) for the task of modeling foreground/background (binary) and parts-based (categorical) shape images. We show that the SBM characterizes a strong model of shape, in that samples from the model look realistic and it can generalize to generate samples that differ from training examples. We find that the SBM learns distributions that are qualitatively and quantitatively better than existing models for this task.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price includes VAT (USA)
Tax calculation will be finalised during checkout.
We set \(S=10,000\) in our experiments.
Ackley, D., Hinton, G., & Sejnowski, T. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9(1), 147–169.
Alexe, B., Deselaers, T., & Ferrari, V. (2010a). ClassCut for unsupervised class segmentation. In European Conference on Computer vision (pp. 380–393).
Alexe, B., Deselaers, T., & Ferrari, V. (2010b). What is an object?. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 73–80).
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). SCAPE: Shape completion and animation of people. ACM Transactions on Graphics (SIGGRAPH), 24(3), 408–416.
Bertozzi, A., Esedoglu, S., & Gillette, A. (2007). Inpainting of binary images using the Cahn–Hilliard equation. IEEE Transactions on Image Processing, 16(1), 285–291.
Bo, Y., & Fowlkes, C. (2011). Shape-based pedestrian parsing. In IEEE Conference on Computer Vision and Pattern Recognition 2011.
Borenstein, E., Sharon, E., & Ullman, S. (2004). Combining top-down and bottom-up segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision.
Boykov, Y., & Jolly, M. P. (2001). Interactive graph cuts for oOptimal boundary & region segmentation of objects in N-D images. In International Conference on Computer Vision 2001 (pp. 105–112).
Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in Neural Information Processing Systems (Vol. 2, pp. 211–217).
Cemgil, T., Zajdel, W., & Krose, B. (2005). A hybrid graphical model for robust feature extraction from video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1158–1165).
Chan, T. F., & Shen, J. (2001). Nontexture inpainting by curvature-driven diffusions. Journal of Visual Communication and Image Representation, 12(4), 436–449.
Chen, F., Yu, H., Hu, R., & Zeng, X. (2013). Deep learning shape priors for object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1870–1877).
Cootes, T., Taylor, C., Cooper, D. H., & Graham, J. (1995). Active shape models—Their training and application. Computer Vision and Image Understanding, 61, 38–59.
Desjardins, G., & Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision. Tech. Rep. 1327, Département d’Informatique et de Recherche Opérationnelle, Université de Montréal.
Eslami, S. M. A., & Williams, C. K. I. (2011). Factored shapes and appearances for parts-based object understanding. In British Machine Vision Conference 2011, (pp. 18.1–18.12).
Eslami, S. M. A., & Williams, C. K. I. (2012). A generative model for parts-based object segmentation. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 25, pp. 100–107). Red Hook, NY: Curran Associates, Inc.
Fei-Fei, L., Fergus, R., Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition 2004, Workshop on Generative-Model Based Vision.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99, 1–19.
Freund, Y., & Haussler, D. (1994). Unsupervised learning of distributions on binary vectors using two layer networks, Tech. Rep. UCSC-CRL-94-25. Santa Cruz: University of California.
Frey, B., Jojic, N., & Kannan, A. (2003). Learning appearance and transparency manifolds of occluded objects in layer. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 45–52).
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Gavrila, D. M. (2007). A Bayesian, exemplar-based approach to hierarchical shape matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 1408–1421.
Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In International Conference on Computer Vision.
Heess, N., Roux, N. L., & Winn, J. M. (2011). Weakly supervised learning of foreground-background segmentation using masked RBMs. In International Conference on Artificial Neural Networks (Vol. 2, pp. 9–16).
Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Jojic, N., & Caspi, Y. (2004). Capturing image structure with probabilistic index maps. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 212–219).
Jojic, N., Perina, A., Cristani, M., Murino, V., & Frey, B. (2009). Stel component analysis: Modeling spatial correlations in image class structure. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2044–2051).
Kapoor, A. & Winn, J. (2006). Located hidden random fields: Learning discriminative parts for object detection. In European Conference on Computer Vision (pp. 302–315).
Kohli, P., Kumar, M. P., Torr, P. H. S. (2007). P3 & beyond: Solving energies with higher order cliques. In IEEE Conference on Computer Vision and Pattern Recognition.
Kohli, P., Ladicky, L., & Torr, P. H. S. (2009). Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3), 302–324.
Komodakis, N. & Paragios, N. (2009). Beyond pairwise energies: Efficient optimization for higher-order mrfs. In IEEE Conference on Computer Vision and Pattern Recognition 2007 (pp. 2985–2992).
Kumar, P., Torr, P., & Zisserman, A. (2005). OBJ CUT. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 18–25).
Lampert, C. H., Blaschko, M., & Hofmann, T. (2008). Beyond sliding windows: Object localization by efficient subwindow search. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).
Le Roux, N., & Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6), 1631–1649.
Le Roux, N., Heess, N., Shotton, J., & Winn, J. (2011). Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3), 593–650.
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of Hierarchical representations. In International Conference on Machine Learning (pp. 609–616).
Morris, R. D., Descombes, X., & Zerubia, J. (1996). The Ising/Potts model is not well suited to segmentation tasks. In Proceedings of the IEEE Digital Signal Processing Workshop.
Murray, I., & Salakhutdinov, R. (2009). Evaluating probabilities under high-dimensional latent variable models. In Advances in Neural Information Processing Systems (Vol. 21).
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56, 71–113.
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139.
Norouzi, M., Ranjbar, M., & Mori, G. (2009). Stacks of convolutional restricted Boltzmann machines for shift-invariant feature learning. In CVPR (pp. 2735–2742).
Nowozin, S., & Lampert, C. H. (2009). Global connectivity potentials for random field models. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 818–825).
Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-scale deep unsupervised learning using graphics processors. In International Conference on Machine Learning (pp. 873–880).
Ranzato, M., Mnih, V., & Hinton, G. E. (2010). How to generate realistic images using gated MRFs. In J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, & A. Culotta (Eds.), Advances in Neural Information Processing Systems (Vol. 23). Cambridge: MIT Press.
Ranzato, M., Susskind, J., Mnih, V., & Hinton, G. E. (2011). On deep generative models with applications to recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2857–2864).
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
Roth, S., & Black, M. J. (2005). Fields of experts: A framework for learning image priors. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 860–867).
Rother, C., Kolmogorov, V., & Blake, A. (2004). “GrabCut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (SIGGRAPH), 23, 309–314.
Rother, C., Kohli, P., Feng, W., & Jia, J. (2009). Minimizing sparse higher order energy functions of discrete variables. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1382–1389).
Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 23–38.
Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.
Salakhutdinov, R. & Hinton, G. (2009). Deep Boltzmann machines. In International Conference on Artificial Intelligence and Statistics 2009, (Vol. 5, pp. 448–455).
Salakhutdinov, R., & Murray, I. (2008). On the quantitative analysis of deep belief networks. In International Conference on Machine Learning 2008.
Schneiderman, H. (2000). A statistical approach to 3D object detection applied to faces and cars. PhD Thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
Shekhovtsov, A., Kohli, P., & Rother, C. (2012). Curvature prior for MRF-based segmentation and shape inpainting. In DAGM/OAGM Symposium (pp. 41–51).
Sigal, L., Balan, A., & Black, M. (2010). HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27.
Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., & Gool, L. V. (2009). Using multi-view recognition and meta-data annotation to guide a robot’s attention. International Journal of Robotics Research, 28(8), 976–998.
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In International Conference on Machine Learning 2008 (pp. 1064–1071).
Tjelmeland, H., & Besag, J. (1998). Markov random fields with higher-order interactions. Scandinavian Journal of Statistics, 25(3), 415–433.
Williams, C. K. I., & Titsias, M. (2004). Greedy learning of multiple objects in images using robust statistics and factorial learning. Neural Computation, 16(5), 1039–1062.
Winn, J., & Jojic, N. (2005). LOCUS: Learning object classes with unsupervised segmentation. In International Conference on Computer Vision (pp. 756–763).
Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. In Stochastics and Stochastics Reports (Vol. 65, pp. 177–228).
Younes, L., & Sud, P. (1989). Parametric inference for imperfectly observed Gibbsian fields. Probability Theory and Related Fields, 82, 625–645.
The majority of this work was performed whilst AE and NH were at Microsoft Research in Cambridge. Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli for valuable discussions. AE acknowledges funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme of the European Community under the PASCAL2 Network of Excellence (IST-2007-216886). NH acknowledges funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant agreement no. 270327, and from the Gatsby Charitable foundation. We finally thank the anonymous referees for their comments which helped improve the paper.
About this article
Cite this article
Eslami, S.M.A., Heess, N., Williams, C.K.I. et al. The Shape Boltzmann Machine: A Strong Model of Object Shape. Int J Comput Vis 107, 155–176 (2014). https://doi.org/10.1007/s11263-013-0669-1
- Deep Boltzmann machine