Skip to main content
Log in

Learning Sparse FRAME Models for Natural Image Patterns

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

It is well known that natural images admit sparse representations by redundant dictionaries of basis functions such as Gabor-like wavelets. However, it is still an open question as to what the next layer of representational units above the layer of wavelets should be. We address this fundamental question by proposing a sparse FRAME (Filters, Random field, And Maximum Entropy) model for representing natural image patterns. Our sparse FRAME model is an inhomogeneous generalization of the original FRAME model. It is a non-stationary Markov random field model that reproduces the observed statistical properties of filter responses at a subset of selected locations, scales and orientations. Each sparse FRAME model is intended to represent an object pattern and can be considered a deformable template. The sparse FRAME model can be written as a shared sparse coding model, which motivates us to propose a two-stage algorithm for learning the model. The first stage selects the subset of wavelets from the dictionary by a shared matching pursuit algorithm. The second stage then estimates the parameters of the model given the selected wavelets. Our experiments show that the sparse FRAME models are capable of representing a wide variety of object patterns in natural images and that the learned models are useful for object classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169.

    Article  Google Scholar 

  • Adler, A., Elad, M., & Hel-Or, Y. (2013). Probabilistic Subspace Clustering via Sparse Representations. IEEE Signal Processing Letters, 20, 63–66.

    Article  Google Scholar 

  • Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54, 4311–4322.

    Article  Google Scholar 

  • Bengio, Y., Courville, A. C., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on PAMI, 35, 1798–1828.

  • Bruckstein, A. M., Donoho, D. L., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51, 34–81.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43, 129–159.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, J., & Huo, X. (2005). Sparse representations for multiple measurements vectors (mmv) in an overcomplete dictionary. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 257–260.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886-893.

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.

    MathSciNet  MATH  Google Scholar 

  • Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters, 195, 216–222.

    Article  Google Scholar 

  • Elad, M. (2010). Sparse and redundant representations: From theory to applications in signal and image processing. Berlin: Springer.

    Book  Google Scholar 

  • Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions Image Processing, 15, 3736–3745.

    Article  MathSciNet  Google Scholar 

  • Elad, M., Milanfar, P., & Rubinstein, R. (2007). Analysis versus synthesis in signal priors. Inverse problems, 23(3), 947.

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classication. Journal of Machine Learning Research, 9, 1871–1874.

    MATH  Google Scholar 

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: Procedings of the Computer Vision and Pattern Recognition Workshops.

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on PAMI, 32, 1627–1645.

    Article  Google Scholar 

  • Ferrari, V., Jurie, F., & Schmid, C. (2010). From images to shape models for object detection. International Journal of Computer Vision, 87, 284–303.

  • Fidler, S., Boben, M. & Leonardis, A. (2008). Similarity-based cross-layered hierarchical representation for object categorization. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR).

  • Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266.

    Article  MathSciNet  Google Scholar 

  • Gelman, A., & Meng, X. L. (1998). Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163–185.

    Article  MathSciNet  MATH  Google Scholar 

  • Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6, 721–741.

    Article  MATH  Google Scholar 

  • Geman, S., Potter, D. F., & Chi, Z. (2002). Composition systems. Quarterly of Applied Mathematics, 60, 707–736.

    MathSciNet  MATH  Google Scholar 

  • Gong, B., Shi, Y., Sha, F. & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: an unsupervised approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Caltech: Technical report.

  • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800.

    Article  MATH  Google Scholar 

  • Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.

    Article  MathSciNet  MATH  Google Scholar 

  • Hoffman, J., Rodner, E., Donahue, J., Saenko, K., & Darrell, T. (2013). Efficient learning of domain-invariant image representations. In: Proceedings of the International Conference of Learning Representations.

  • Hong, Y., Si, Z., Hu, W., Zhu, S. C., & Wu, Y. N. (2013). Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics, 72, 373–406.

    Article  MathSciNet  Google Scholar 

  • Jhou, I., Liu, D., Lee, D. T. & Chang, S. (2012). Robust visual domain adaptation with low-rank reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR)..

  • Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning.

  • Liu, C., Zhu, S.-C., & Shum, H.-Y. (2001). Learning inhomogeneous gibbs model of faces by minimax entropy. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 281–287.

  • Lounici, K., Tsybakov, A. B., Pontil, M., & van de Geer, S. A. (2009). Taking advantage of sparsity in multi-task learning. In: Proceedings of the 22nd Conference on Learning Theory.

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91–110.

  • Mallat, S., & Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415.

    Article  MATH  Google Scholar 

  • Marszalek, M., & Schmid, C. (2007). Accurate object localization with shape masks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Nama, S., Daviesb, M. E., Eladc, M., & Gribonval, R. (2013). The cosparse analysis model and algorithms. Applied and Computational Harmonic Analysis, 34, 30–56.

    Article  MathSciNet  Google Scholar 

  • Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11, 125–139.

    Article  MathSciNet  Google Scholar 

  • Neal, R. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.

  • Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39, 1–47.

    Article  MathSciNet  MATH  Google Scholar 

  • Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.

    Article  Google Scholar 

  • Pati, Y. C., Rezaiifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In: Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, pp. 40–44.

  • Pietra, S. D., Pietra, V. D., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on PAMI, 19, 380–393.

  • Ranzato, M., & Hinton, G. E. (2010). Modeling pixel means and covariances using factorized third-order Boltzmann machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025.

    Article  Google Scholar 

  • Roth, S., & Black, M. (2009). Fields of experts. International Journal of Computer Vision, 82, 205–229.

  • Rubinstein, R., Zibulevsky, M., & Elad, M. (2010). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58, 1553–1564.

    Article  MathSciNet  Google Scholar 

  • Saenko, K., Kulis, B., Fritz, M. & Darrell, T. (2010). Adapting visual category models to new domains. In: Proceedings of the European Conference on Computer Vision (ECCV).

  • Shekhar, S., Patel, V. M., Nguyen, H. V., & Chellappa, R. (2013). Generalized domain adaptive dictionaries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Si, Z., & Zhu, S. C. (2012). Learning hybrid image template (HIT) by information projection. IEEE Transactions on PAMI, 34, 1354–1367.

    Article  Google Scholar 

  • Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (pp. 194–281). Cambridge: MIT Press.

  • Teh, Y. W., Welling, M., Osindero, S., & Hinton, G. E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4, 1235–1260.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B, 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Tropp, J., Gilbert, A., & Straus, M. (2006). Algorithms for simultaneous sparse approximation. part I: Greedy pursuit. Journal of Signal Processing, 86, 572–588.

    Article  MATH  Google Scholar 

  • Tuytelaars, T., Lampert, C. H., Blaschko, M. B., & Buntine, W. (2009). Unsupervised object discovery: A comparison. International Journal of Computer Vision, 88(2), 284-302.

  • Vapnik, V. N. (2000). The nature of statistical learning theory. Berlin: Springer.

  • Welling, M., Hinton, G. E., & Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In: Proceedings of Advances in Neural Information Processing Systems (NIPS).

  • Wu, Y. N., Si, Z., Gong, H., & Zhu, S. C. (2010). Learning active basis model for object detection and recognition. International Journal of Computer Vision, 90, 198–235.

  • Xie, J., Hu, W., Zhu, S. C., & Wu, Y. N. (2014). Learning Inhomogeneous FRAME models for object patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Yang, M., Zhang, L., Feng, X., & Zhang, D. (2011). Fisher discrimination dictionary learning for sparse representation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 543-550.

  • Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65, 177–228.

    Article  MathSciNet  MATH  Google Scholar 

  • Zeiler, M., Taylor, G., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Zhu, L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In: Proceedings of the European Conference on Computer Vision (ECCV).

  • Zhu, S. C., & Mumford, D. B. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2, 259–362.

    Article  MATH  Google Scholar 

  • Zhu, S. C., Wu, Y. N., & Mumford, D. B. (1998). Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627–1660.

    Article  Google Scholar 

Download references

Acknowledgments

The work is supported by NSF DMS 1310391, NSF IIS 1423305, ONR MURI N00014-10-1-0933, DARPA MSEE FA8650-11-1-7149. We thank the three reviewers for their insightful comments and valuable suggestions that have helped us improve the presentation and the content of this paper. We are grateful to one reviewer for sharing the insights on the analysis prior models. Thanks also go to an editor of the special issue for helpful suggestions. We thank Adrian Barbu for discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Nian Wu.

Additional information

Communicated by Julien Mairal, Francis Bach, and Michael Elad.

Appendices

Appendix: Simulation by Hamiltonian Monte Carlo

To approximate \(\mathrm{E}_{p(\mathbf{I};\lambda ^{(t)})}[|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |]\) in Eq. (9), we need to draw a synthesized sample set \(\{\tilde{\mathbf{I}}_m\}\) from \(p(\mathbf{I};\lambda ^{(t)})\) by HMC (Duane et al. 1987). We can write \(p(\mathbf{I}; \lambda )\) as \(p(\mathbf{I}) \propto \exp (-U(\mathbf{I}))\), where \(\mathbf{I}\in R^{|\mathcal{D}|}\) and

$$\begin{aligned} U(\mathbf{I})=-\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \big | \langle \mathbf{I}, B_{x, s, \alpha } \rangle \big |+ \frac{1}{2} |\mathbf{I}|^2 \end{aligned}$$
(33)

(assuming \(\sigma ^2 = 1\)). In physics context, \(\mathbf{I}\) can be regarded as a position vector and \(U(\mathbf{I})\) the potential energy function. To allow Hamiltonian dynamics to operate, we need to introduce an auxiliary momentum vector \({\varvec{\phi }}\in R^{|\mathcal{D}|}\) and the corresponding kinetic energy function \(K({\varvec{\phi }})=|{\varvec{\phi }}|^2/2m\), where \(m\) represents the mass. After that, a fictitious physical system described by the canonical coordinates \((\mathbf{I},{\varvec{\phi }})\) is defined, and its total energy is \(H(\mathbf{I},{\varvec{\phi }})=U(\mathbf{I})+K({\varvec{\phi }})\). Instead of sampling from \(p(\mathbf{I})\) directly, HMC samples from the joint canonical distribution \(p(\mathbf{I},{\varvec{\phi }}) \propto \exp (-H(\mathbf{I},{\varvec{\phi }}))\), under which \(\mathbf{I}\sim p(\mathbf{I})\) marginally and \({\varvec{\phi }}\) follows a Gaussian distribution and is independent of \(\mathbf{I}\). Each time HMC draws a random sample from the marginal Gaussian distribution of \({\varvec{\phi }}\), and then evolves according to the Hamiltonian dynamics that conserves the total energy.

In practical implementation, the leapfrog algorithm is used to discretize the continuous Hamiltonian dynamics as follows, with \(\epsilon \) being the step-size:

$$\begin{aligned}&{\varvec{\phi }}^{(t+\epsilon /2)}={\varvec{\phi }}^{(t)}-\big (\epsilon /2\big )\frac{\partial U}{\partial \mathbf{I}}\big (\mathbf{I}^{(t)}\big ), \end{aligned}$$
(34)
$$\begin{aligned}&\mathbf{I}^{(t+\epsilon )}= \mathbf{I}^{(t)} + \epsilon \frac{{\varvec{\phi }}^{(t+\epsilon /2)}}{m}, \end{aligned}$$
(35)
$$\begin{aligned}&{\varvec{\phi }}^{(t+\epsilon )}={\varvec{\phi }}^{(t+\epsilon /2)}-(\epsilon /2)\frac{\partial U}{\partial \mathbf{I}}\big (\mathbf{I}^{(t+\epsilon )}\big ), \end{aligned}$$
(36)

that is, a half-step update of \({\varvec{\phi }}\) is performed first and then it is used to compute \(\mathbf{I}^{(t + \epsilon )}\) and \({\varvec{\phi }}^{(t + \epsilon )}\).

A key step in the leapfrog algorithm is the computation of the derivative of the potential energy function

$$\begin{aligned} \frac{\partial U}{ \partial \mathbf{I}}=-\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \text{ sign }\big ( \langle \mathbf{I}, B_{x, s, \alpha } \rangle \big )B_{x, s, \alpha }+ \mathbf{I}, \end{aligned}$$
(37)

where the map of responses \(r_{x, s, \alpha } = \langle \mathbf{I}, B_{x, s, \alpha } \rangle \) is computed by bottom-up convolution of the filter corresponding to \((s, \alpha )\) with \(\mathbf{I}\) for each \((s, \alpha )\). Then the derivative is computed by top-down linear superposition of the basis functions: \(-\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \text{ sign }( r_{x, s, \alpha } )B_{x, s, \alpha } + \mathbf{I}\), which can again be computed by convolution. Both bottom-up and top-down convolutions can be carried out efficiently by GPUs.

The discretization of the leapfrog algorithm cannot keep \(H(\mathbf{I}, {\varvec{\phi }})\) exactly constant, so a Metropolis acceptance/rejection step is used to correct the discretization error. Starting with the current state, \((\mathbf{I},{\varvec{\phi }})\), the new state \((\mathbf{I}^ \star ,{\varvec{\phi }}^ \star )\), after \(L\) leapfrog steps, is accepted as the next state of the Markov chain with probability \( \min [1, \exp (-H(\mathbf{I}^ \star ,{\varvec{\phi }}^ \star )+H(\mathbf{I},{\varvec{\phi }})) ]. \) If it is not accepted, the next state is the same as the current state.

In summary, a complete description of the HMC sampler for inhomogeneous FRAME is as follows:

  1. (i)

    Generate the momentum vector \({\varvec{\phi }}\) from its marginal distribution \(p({\varvec{\phi }}) \propto \exp (-K({\varvec{\phi }}))\), which is the zero-mean Gaussian distribution with covariance matrix \(m I\) (\(I\) is the identity matrix).

  2. (ii)

    Perform \(L\) leapfrog steps to reach the new state \((\mathbf{I}^{\star },{\varvec{\phi }}^{\star }).\)

  3. (iii)

    Perform acceptance/rejection of the proposed state \((\mathbf{I}^{\star },{\varvec{\phi }}^{\star }).\)

\(L\), \(\epsilon \), and \(m\) are parameters of the algorithm, which need to be tuned to obtain good performance.

Maximum Entropy Justification

The inhomogeneous FRAME model can be justified by the maximum entropy principle. Suppose the true distribution that generates the observed images \(\{\mathbf{I}_m\}\) is \(f(\mathbf{I})\). Let \(\lambda ^{\star }\) solve the population version of the maximum likelihood equation:

$$\begin{aligned} \mathrm{E}_{p(\mathbf{I}; \lambda )}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ] = \mathrm{E}_{f}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ], \quad \forall x, s, \alpha . \end{aligned}$$
(38)

Let \(\varOmega \) be the set of all the distributions \(p(\mathbf{I})\) such that

$$\begin{aligned} \mathrm{E}_{p}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ] = \mathrm{E}_{f}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ], \quad \forall x, s, \alpha . \end{aligned}$$
(39)

Then \(f \in \varOmega \). Let \(\Lambda \) be the set of all the distributions \(\{p_{\lambda }, \forall \lambda \}\), where \(p_{\lambda }(\mathbf{I}) = p(\mathbf{I}; \lambda )\). Then \(q \in \Lambda \) since \(q(\mathbf{I}) = p(\mathbf{I}; \lambda = 0)\). Thus \(p_{\lambda ^\star }\) is the intersection between \(\Lambda \) and \(\varOmega \). In Fig. 17, \(\Lambda \) and \(\varOmega \) are illustrated by blue and green curves respectively, where each point on the curves is a probability distribution. The two curves \(\Lambda \) and \(\varOmega \) are “orthogonal” in the sense that for any \(p_{\lambda } \in \Lambda \) and for any \(p \in \varOmega \), it can be easily proved that the Pythagorean property

$$\begin{aligned} \mathrm{KL}\big (p || p_{\lambda }\big ) = \mathrm{KL}\big (p || p_{\lambda ^{\star }}\big ) + \mathrm{KL}\big (p_{\lambda ^{\star }}||p_{\lambda }\big ) \end{aligned}$$
(40)

holds (Pietra et al. 1997), where \(\mathrm{KL}(p||q)\) is the Kullback-Leibler divergence from \(p\) to \(q\). This Pythagorean property leads to the following dual properties of \(p_{\lambda ^{\star }}\):

  1. (1)

    Maximum likelihood: Among all \(p_{\lambda } \in \Lambda \), \(p_{\lambda ^{\star }}\) achieves the minimum of \(\mathrm{KL}(f||p_{\lambda })\).

  2. (2)

    Maximum entropy or minimum divergence: Among all \(p \in {\varOmega }\), \(p_{\lambda ^{\star }}\) achieves the minimum of \(\mathrm{KL}(p||q)\). Thus \(p_{\lambda ^{\star }}\) can be considered the minimal modification of the reference distribution \(q\) to match the statistical properties of the true distribution \(f\).

The above justification is also true for the sparse FRAME model.

For sparsification, in principle, we can select \(B_{x_i, s_i, \alpha _i}\) sequentially using a procedure like projection pursuit (Friedman 1987) or filter pursuit (Zhu et al. 1998). Suppose we have selected \(k\) basis functions \((B_{x_i, s_i, \alpha _i}, i = 1, \ldots , k)\), and let \(p_k\) be the fitted model with the corresponding \(\lambda = (\lambda _i, i = 1, \ldots , k)\) estimated by MLE. Suppose we are to select the next basis function \(B_{x_{k+1}, s_{k+1}, \alpha _{k+1}}\). Let \(p_{k+1}\) be the fitted model. Then we want to minimize \(\mathrm{KL}(f||p_{k+1}) = \mathrm{KL}(f||p_{k}) - \mathrm{KL}(p_{k+1}||p_k)\), that is, we want to maximize \(\mathrm{KL}(p_{k+1}||p_k)\), which serves as the pursuit index. The problem with such a procedure is that each time we need to fit \(p_k\) which involves MCMC computation, and the pursuit index is also difficult to compute. So we choose to pursue a different approach by exploring the connection between sparse FRAME and the shared sparse coding.

Fig. 17
figure 17

Illustration of the maximum entropy principle. Each curve illustrates a set of probability distributions. \(\varOmega \) is the set of distributions that reproduce statistical properties of filter response of the true distribution \(f\). \(\Lambda \) is the set of distributions of the model. The two curves are orthogonal to each other in the sense of the Pythagorean property of the Kullback-Leibler divergences. So \(p_{\lambda ^{\star }}\) can be considered the minimal modification of the reference distribution \(q\) to match the statistical properties of \(f\)

Sparse FRAME and Shared Sparse Coding

From sparse FRAME to shared sparse coding Let us assume that the reference distribution \(q(\mathbf{I})\) in the sparse FRAME model (15) is a Gaussian white noise model so that the pixel intensities follow \(\mathrm{N}(0, \sigma ^2)\) independently. For sparse FRAME, it is natural to assume that the number of selected basis functions \(n\) is much less than the number of pixels in \(\mathbf{I}\), i.e., \(n \ll |\mathcal{D}|\), where \(\mathcal{D}\) is the image domain. For notational convenience, we can make \(\mathbf{I}\) and \(B_i = B_{x_i, s_i, \alpha _i}\), \(i = 1, \ldots , n\) into \(|\mathcal{D}|\)-dimensional vectors, and let \(\mathbf{B}= (B_1, \ldots , B_n)\) be the resulting \(|\mathcal{D}| \times n\) matrix.

The connection between sparse FRAME and shared sparse coding is most evident if we temporarily assume that the selected basis functions \((B_i, i = 1, \ldots , n)\) are orthogonal (with unit \(\ell _2\) norm as assumed before). Extension to non-orthogonal \(\mathbf{B}\) is straightforward but requires tedious notation (such as \((\mathbf{B}^{T}\mathbf{B})^{-1}\)). For \(\mathbf{B}\), we can construct \(\bar{n} = |\mathcal{D}| - n\) basis vectors of unit norm \(\bar{B}_1, \ldots , \bar{B}_{\bar{n}}\) that are orthogonal to each other and that are also orthogonal to \((B_i, i = 1, \ldots , n)\). Thus each image \(\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \sum _{i=1}^{\bar{n}} \bar{r}_i \bar{B}_i\), where \(r_i = \langle \mathbf{I}, B_i \rangle \), and \(\bar{r}_i = \langle \mathbf{I}, \bar{B}_i\rangle \). So we have the linear additive model \(\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \epsilon \), with \(\epsilon = \sum _{i=1}^{\bar{n}} \bar{r}_i \bar{B}_i\) being the least squares residual image.

Under the Gaussian white noise \(q(\mathbf{I})\), \(r_i\) and \(\bar{r}_i\) are all independent \(\mathrm{N}(0, \sigma ^2)\) random variables because of the orthogonality of \((\mathbf{B}, \bar{\mathbf{B}})\). Let \(R\) be the column vector whose elements are \(r_i\), and \(\bar{R}\) be the column vector whose elements are \(\bar{r}_i\). Then under the sparse FRAME model (15), only the distribution of \(R\) is modified during the change from \(q(\mathbf{I})\) to \(p(\mathbf{I}; \mathbf{B}, \lambda )\), which changes the distribution of \(R\) from Gaussian white noise \(q(R)\) to

$$\begin{aligned} p(R; \lambda ) = \frac{1}{Z(\lambda )}\exp \left( \sum _{i=1}^{n} \lambda _i |r_i|\right) q(R), \end{aligned}$$
(41)

while the distribution of the residual coordinates \(\bar{R}\) remains Gaussian white noise, and \(R\) and \(\bar{R}\) remain independent. That is, \(p(R, \bar{R}; \lambda ) = p(R; \lambda ) q(\bar{R}) \).

Thus the sparse FRAME model implies a linear additive model \(\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \epsilon \), where \(R \sim p(R; \lambda )\) and \(\epsilon \) is a Gaussian white noise in the \(\bar{n}\)-dimensional residual space, and \(\epsilon \) is independent of \(R\). If we observe independent training images \(\{\mathbf{I}_m, m = 1, \ldots , M\}\) from the model, then \(\mathbf{I}_m = \sum _{i=1}^{n} r_{m, i} B_i + \epsilon _m\), i.e., \(\{\mathbf{I}_m\}\) share a common set of basis functions \(\mathbf{B}= (B_i, i = 1, \ldots , n)\) that provide sparse coding for multiple images simultaneously.

From shared sparse coding to sparse FRAME Conversely, suppose we are given a shared sparse coding model of the form \(\mathbf{I}=\sum _{i=1}^{n} c_i B_i + \epsilon = \mathbf{B}C + \epsilon \), where \(C\) is a column vector whose components are \(c_i\). Assume \(C \sim p(C)\) and \(\epsilon \sim \mathrm{N}(0, I \sigma ^2)\), where \(I\) is the \(|\mathcal{D}|\)-dimensional identity matrix, and \(\epsilon \) and \(C\) are independent. Let \(\delta = \mathbf{B}^T \epsilon \), each component of which \(\delta _i = \langle \epsilon , B_i\rangle \sim \mathrm{N}(0, \sigma ^2)\) independently. Then we can write \(\mathbf{I}= \mathbf{B}R + \bar{\mathbf{B}}\bar{R}\), where \(R = C + \delta \), and \(\bar{\epsilon } = \bar{\mathbf{B}}\bar{R}\) is the projection of \(\epsilon \) onto the space of \(\bar{\mathbf{B}}\). Let \(\tilde{p}(R)\) be the density of \(R = C+ \delta \), which is obtained by convolving \(p(C)\) with Gaussian white noise density. Then \(p(\mathbf{I}) = \tilde{p}(R) q(\bar{R}) = q(\mathbf{I}) \tilde{p}(R) /q(R)\) since \(q(\mathbf{I}) = q(R)q(\bar{R})\) under Gaussian white noise model (\(d\mathbf{I}= dR d\bar{R}\) under orthogonality so there is no Jacobian term). If we choose to model \(\tilde{p}(R)/q(R) = \exp \left( \sum _{i=1}^{n} \lambda _i |r_i|\right) /Z(\lambda )\), we arrive at the sparse FRAME model.

Selection of basis functions For orthogonal \(\mathbf{B}\), as shown above, the probability density \(p(\mathbf{I}; \mathbf{B}, \lambda ) = q(\bar{R}) p(R; \lambda ) = q(\bar{R}) q(R) \exp \left( \sum _{i=1}^{n} \lambda _i |r_i|\right) /Z(\lambda )\). Given a set of training images \(\{\mathbf{I}_m, m = 1, \ldots , M\}\), and for a candidate set of basis functions \(\mathbf{B}= (B_i, i= 1, \ldots , n)\), we can estimate \(\lambda = (\lambda _i, i = 1, \ldots , n)\) by MLE, giving us \(\lambda ^{\star }\), and the resulting log-likelihood is

$$\begin{aligned}&\sum _{m=1}^{M} \log p\big (\mathbf{I}_m; \mathbf{B}, \lambda ^{\star }\big ) \nonumber \\&\quad = \sum _{m=1}^{M} \left[ \log q\big (\bar{R}_m\big ) + \log p\big (R_m; \lambda ^{\star }\big )\right] \end{aligned}$$
(42)
$$\begin{aligned}&=- \frac{1}{2\sigma ^2} \sum _{m=1}^{M} ||\mathbf{I}_m - \mathbf{B}R_m||^2 - \frac{M \bar{n}}{2} \log \big (2\pi \sigma ^2\big )\end{aligned}$$
(43)
$$\begin{aligned}&\quad + \sum _{m=1}^{M} \log p\big (R_m; \lambda ^{\star }\big ). \end{aligned}$$
(44)

Suppose we are to choose a \(\mathbf{B}\) from a collection of candidates. Ideally we should maximize the sum of (43) and (44). We may interpret (43) to be the negative coding length of the residual image \(\epsilon \) by the Gaussian white noise model, and interpret (44) to be the negative coding length of the coefficients \(R_m\) by the fitted model \(p(R; \lambda ^{\star })\). If \(\sigma ^2\) is small, (43) can be more important, while the coding length of \(R_m\) for different \(\mathbf{B}\) may not differ too much in comparison. So we choose to seek a \(\mathbf{B}\) to maximize only (43) or equivalently minimize the overall reconstruction error \(\sum _{m=1}^{M} ||\mathbf{I}_m - \mathbf{B}R_m||^2\). This reflects a two-stage strategy in modeling \(\{\mathbf{I}_m\}\). First, we find a set of basis functions \(\mathbf{B}\) to reconstruct \(\{\mathbf{I}_m\}\) as accurately as possible. Then we fit a statistical model for the reconstruction coefficients.

Non-orthogonality Even if \(\mathbf{B}\) is not orthogonal, which is the case in our work, the connection between the sparse FRAME and shared sparse coding still holds. The responses \(R = \mathbf{B}^{T} \mathbf{I}\), but the reconstruction coefficients become \(C = (\mathbf{B}^{T}\mathbf{B})^{-1}R\). The projection of \(\mathbf{I}\) onto the subspace spanned by \(\mathbf{B}\) is \(\mathbf{B}C\). We can continue to assume the implicit \(\bar{\mathbf{B}}= (\bar{B}_i, i = 1, \ldots , \bar{n})\) to be orthonormal, and that they are orthogonal to the columns of \(\mathbf{B}\). We can also continue to let \(\bar{R} = \bar{\mathbf{B}}^{T}\mathbf{I}\). In this setting, \(R\) and \(\bar{R}\) are still independent under the Gaussian white noise model \(q(\mathbf{I})\) because \(\mathbf{B}\) and \(\bar{\mathbf{B}}\) are still orthogonal to each other. Under the sparse FRAME model (15), it is still the case that only the distribution of \(R\) is modified during the change from \(q(\mathbf{I})\) to \(p(\mathbf{I}; \mathbf{B}, \lambda )\), while the distribution of \(\bar{R}\) remains white noise and is independent of \(R\). The distribution of \(R\) implies a distribution of the reconstruction coefficients \(C\) because they are linked by a linear transformation. In fact, the distribution of \(C\) is:

$$\begin{aligned} p_C(C; \lambda ) = \frac{1}{Z(\lambda )} \exp \big (\langle \lambda , |\mathbf{B}^{T}\mathbf{B}C|\rangle \big ) q_C(C), \end{aligned}$$
(45)

where \(q_C(C)\) is the distribution of \(C\) under the reference distribution \(q(\mathbf{I})\), and for a vector \(u\), \(|u|\) means the vector obtained by taking the absolute values of \(u\) component-wise. Now the distributions of \(R\) and \(C\) involve the Jacobian terms such that \(dR d\bar{R} = |\mathrm{det}(\mathbf{B}^{T}\mathbf{B})|^{1/2} d\mathbf{I}= |\mathrm{det}(\mathbf{B}^{T}\mathbf{B})| dC d\bar{R}\). In fact \(p(\mathbf{I}; \mathbf{B}, \lambda ) = p_C(C; \lambda )q_{\bar{R}}(\bar{R}) |\det (\mathbf{B}^{T}\mathbf{B})|^{-1/2}\). By the same logic as in (43) and (44), we still want to find \(\mathbf{B}\) to minimize the overall reconstruction error \(\sum _{m=1}^{M}\Vert \mathbf{I}_m - \mathbf{B}C_m\Vert ^2\).

Under the shared sparse coding model, it is tempting to model the coefficients \(C\) of the selected basis functions directly. However, \(C\) is still a multi-dimensional vector, and direct modeling of \(C\) can be difficult. One may assume that the components of \(C\) are statistically independent for simplicity, but this assumption is unlikely to be realistic. So after selecting the basis functions, we choose to model the image intensities by the inhomogeneous FRAME model. Even though this model only matches the marginal distributions of filter responses of the selected basis functions, the model does not assume that the responses are independent.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, J., Hu, W., Zhu, SC. et al. Learning Sparse FRAME Models for Natural Image Patterns. Int J Comput Vis 114, 91–112 (2015). https://doi.org/10.1007/s11263-014-0757-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0757-x

Keywords

Navigation