Abstract
It is well known that natural images admit sparse representations by redundant dictionaries of basis functions such as Gaborlike wavelets. However, it is still an open question as to what the next layer of representational units above the layer of wavelets should be. We address this fundamental question by proposing a sparse FRAME (Filters, Random field, And Maximum Entropy) model for representing natural image patterns. Our sparse FRAME model is an inhomogeneous generalization of the original FRAME model. It is a nonstationary Markov random field model that reproduces the observed statistical properties of filter responses at a subset of selected locations, scales and orientations. Each sparse FRAME model is intended to represent an object pattern and can be considered a deformable template. The sparse FRAME model can be written as a shared sparse coding model, which motivates us to propose a twostage algorithm for learning the model. The first stage selects the subset of wavelets from the dictionary by a shared matching pursuit algorithm. The second stage then estimates the parameters of the model given the selected wavelets. Our experiments show that the sparse FRAME models are capable of representing a wide variety of object patterns in natural images and that the learned models are useful for object classification.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169.
Adler, A., Elad, M., & HelOr, Y. (2013). Probabilistic Subspace Clustering via Sparse Representations. IEEE Signal Processing Letters, 20, 63–66.
Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54, 4311–4322.
Bengio, Y., Courville, A. C., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on PAMI, 35, 1798–1828.
Bruckstein, A. M., Donoho, D. L., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51, 34–81.
Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43, 129–159.
Chen, J., & Huo, X. (2005). Sparse representations for multiple measurements vectors (mmv) in an overcomplete dictionary. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 257–260.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886893.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximumlikelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.
Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters, 195, 216–222.
Elad, M. (2010). Sparse and redundant representations: From theory to applications in signal and image processing. Berlin: Springer.
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions Image Processing, 15, 3736–3745.
Elad, M., Milanfar, P., & Rubinstein, R. (2007). Analysis versus synthesis in signal priors. Inverse problems, 23(3), 947.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classication. Journal of Machine Learning Research, 9, 1871–1874.
FeiFei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: Procedings of the Computer Vision and Pattern Recognition Workshops.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained partbased models. IEEE Transactions on PAMI, 32, 1627–1645.
Ferrari, V., Jurie, F., & Schmid, C. (2010). From images to shape models for object detection. International Journal of Computer Vision, 87, 284–303.
Fidler, S., Boben, M. & Leonardis, A. (2008). Similaritybased crosslayered hierarchical representation for object categorization. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR).
Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266.
Gelman, A., & Meng, X. L. (1998). Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163–185.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6, 721–741.
Geman, S., Potter, D. F., & Chi, Z. (2002). Composition systems. Quarterly of Applied Mathematics, 60, 707–736.
Gong, B., Shi, Y., Sha, F. & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: an unsupervised approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Griffin, G., Holub, A., & Perona, P. (2007). Caltech256 object category dataset. Caltech: Technical report.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800.
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Hoffman, J., Rodner, E., Donahue, J., Saenko, K., & Darrell, T. (2013). Efficient learning of domaininvariant image representations. In: Proceedings of the International Conference of Learning Representations.
Hong, Y., Si, Z., Hu, W., Zhu, S. C., & Wu, Y. N. (2013). Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics, 72, 373–406.
Jhou, I., Liu, D., Lee, D. T. & Chang, S. (2012). Robust visual domain adaptation with lowrank reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR)..
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning.
Liu, C., Zhu, S.C., & Shum, H.Y. (2001). Learning inhomogeneous gibbs model of faces by minimax entropy. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 281–287.
Lounici, K., Tsybakov, A. B., Pontil, M., & van de Geer, S. A. (2009). Taking advantage of sparsity in multitask learning. In: Proceedings of the 22nd Conference on Learning Theory.
Lowe, D. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60, 91–110.
Mallat, S., & Zhang, Z. (1993). Matching pursuit in a timefrequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415.
Marszalek, M., & Schmid, C. (2007). Accurate object localization with shape masks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Nama, S., Daviesb, M. E., Eladc, M., & Gribonval, R. (2013). The cosparse analysis model and algorithms. Applied and Computational Harmonic Analysis, 34, 30–56.
Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11, 125–139.
Neal, R. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.
Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in highdimensional multivariate regression. Annals of Statistics, 39, 1–47.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.
Pati, Y. C., Rezaiifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In: Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, pp. 40–44.
Pietra, S. D., Pietra, V. D., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on PAMI, 19, 380–393.
Ranzato, M., & Hinton, G. E. (2010). Modeling pixel means and covariances using factorized thirdorder Boltzmann machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025.
Roth, S., & Black, M. (2009). Fields of experts. International Journal of Computer Vision, 82, 205–229.
Rubinstein, R., Zibulevsky, M., & Elad, M. (2010). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58, 1553–1564.
Saenko, K., Kulis, B., Fritz, M. & Darrell, T. (2010). Adapting visual category models to new domains. In: Proceedings of the European Conference on Computer Vision (ECCV).
Shekhar, S., Patel, V. M., Nguyen, H. V., & Chellappa, R. (2013). Generalized domain adaptive dictionaries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Si, Z., & Zhu, S. C. (2012). Learning hybrid image template (HIT) by information projection. IEEE Transactions on PAMI, 34, 1354–1367.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (pp. 194–281). Cambridge: MIT Press.
Teh, Y. W., Welling, M., Osindero, S., & Hinton, G. E. (2003). Energybased models for sparse overcomplete representations. Journal of Machine Learning Research, 4, 1235–1260.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B, 58, 267–288.
Tropp, J., Gilbert, A., & Straus, M. (2006). Algorithms for simultaneous sparse approximation. part I: Greedy pursuit. Journal of Signal Processing, 86, 572–588.
Tuytelaars, T., Lampert, C. H., Blaschko, M. B., & Buntine, W. (2009). Unsupervised object discovery: A comparison. International Journal of Computer Vision, 88(2), 284302.
Vapnik, V. N. (2000). The nature of statistical learning theory. Berlin: Springer.
Welling, M., Hinton, G. E., & Osindero, S. (2003). Learning sparse topographic representations with products of studentt distributions. In: Proceedings of Advances in Neural Information Processing Systems (NIPS).
Wu, Y. N., Si, Z., Gong, H., & Zhu, S. C. (2010). Learning active basis model for object detection and recognition. International Journal of Computer Vision, 90, 198–235.
Xie, J., Hu, W., Zhu, S. C., & Wu, Y. N. (2014). Learning Inhomogeneous FRAME models for object patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yang, M., Zhang, L., Feng, X., & Zhang, D. (2011). Fisher discrimination dictionary learning for sparse representation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 543550.
Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65, 177–228.
Zeiler, M., Taylor, G., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Zhu, L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In: Proceedings of the European Conference on Computer Vision (ECCV).
Zhu, S. C., & Mumford, D. B. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2, 259–362.
Zhu, S. C., Wu, Y. N., & Mumford, D. B. (1998). Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627–1660.
Acknowledgments
The work is supported by NSF DMS 1310391, NSF IIS 1423305, ONR MURI N000141010933, DARPA MSEE FA86501117149. We thank the three reviewers for their insightful comments and valuable suggestions that have helped us improve the presentation and the content of this paper. We are grateful to one reviewer for sharing the insights on the analysis prior models. Thanks also go to an editor of the special issue for helpful suggestions. We thank Adrian Barbu for discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Julien Mairal, Francis Bach, and Michael Elad.
Appendices
Appendix: Simulation by Hamiltonian Monte Carlo
To approximate \(\mathrm{E}_{p(\mathbf{I};\lambda ^{(t)})}[\langle \mathbf{I}, B_{x, s, \alpha }\rangle ]\) in Eq. (9), we need to draw a synthesized sample set \(\{\tilde{\mathbf{I}}_m\}\) from \(p(\mathbf{I};\lambda ^{(t)})\) by HMC (Duane et al. 1987). We can write \(p(\mathbf{I}; \lambda )\) as \(p(\mathbf{I}) \propto \exp (U(\mathbf{I}))\), where \(\mathbf{I}\in R^{\mathcal{D}}\) and
(assuming \(\sigma ^2 = 1\)). In physics context, \(\mathbf{I}\) can be regarded as a position vector and \(U(\mathbf{I})\) the potential energy function. To allow Hamiltonian dynamics to operate, we need to introduce an auxiliary momentum vector \({\varvec{\phi }}\in R^{\mathcal{D}}\) and the corresponding kinetic energy function \(K({\varvec{\phi }})={\varvec{\phi }}^2/2m\), where \(m\) represents the mass. After that, a fictitious physical system described by the canonical coordinates \((\mathbf{I},{\varvec{\phi }})\) is defined, and its total energy is \(H(\mathbf{I},{\varvec{\phi }})=U(\mathbf{I})+K({\varvec{\phi }})\). Instead of sampling from \(p(\mathbf{I})\) directly, HMC samples from the joint canonical distribution \(p(\mathbf{I},{\varvec{\phi }}) \propto \exp (H(\mathbf{I},{\varvec{\phi }}))\), under which \(\mathbf{I}\sim p(\mathbf{I})\) marginally and \({\varvec{\phi }}\) follows a Gaussian distribution and is independent of \(\mathbf{I}\). Each time HMC draws a random sample from the marginal Gaussian distribution of \({\varvec{\phi }}\), and then evolves according to the Hamiltonian dynamics that conserves the total energy.
In practical implementation, the leapfrog algorithm is used to discretize the continuous Hamiltonian dynamics as follows, with \(\epsilon \) being the stepsize:
that is, a halfstep update of \({\varvec{\phi }}\) is performed first and then it is used to compute \(\mathbf{I}^{(t + \epsilon )}\) and \({\varvec{\phi }}^{(t + \epsilon )}\).
A key step in the leapfrog algorithm is the computation of the derivative of the potential energy function
where the map of responses \(r_{x, s, \alpha } = \langle \mathbf{I}, B_{x, s, \alpha } \rangle \) is computed by bottomup convolution of the filter corresponding to \((s, \alpha )\) with \(\mathbf{I}\) for each \((s, \alpha )\). Then the derivative is computed by topdown linear superposition of the basis functions: \(\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \text{ sign }( r_{x, s, \alpha } )B_{x, s, \alpha } + \mathbf{I}\), which can again be computed by convolution. Both bottomup and topdown convolutions can be carried out efficiently by GPUs.
The discretization of the leapfrog algorithm cannot keep \(H(\mathbf{I}, {\varvec{\phi }})\) exactly constant, so a Metropolis acceptance/rejection step is used to correct the discretization error. Starting with the current state, \((\mathbf{I},{\varvec{\phi }})\), the new state \((\mathbf{I}^ \star ,{\varvec{\phi }}^ \star )\), after \(L\) leapfrog steps, is accepted as the next state of the Markov chain with probability \( \min [1, \exp (H(\mathbf{I}^ \star ,{\varvec{\phi }}^ \star )+H(\mathbf{I},{\varvec{\phi }})) ]. \) If it is not accepted, the next state is the same as the current state.
In summary, a complete description of the HMC sampler for inhomogeneous FRAME is as follows:

(i)
Generate the momentum vector \({\varvec{\phi }}\) from its marginal distribution \(p({\varvec{\phi }}) \propto \exp (K({\varvec{\phi }}))\), which is the zeromean Gaussian distribution with covariance matrix \(m I\) (\(I\) is the identity matrix).

(ii)
Perform \(L\) leapfrog steps to reach the new state \((\mathbf{I}^{\star },{\varvec{\phi }}^{\star }).\)

(iii)
Perform acceptance/rejection of the proposed state \((\mathbf{I}^{\star },{\varvec{\phi }}^{\star }).\)
\(L\), \(\epsilon \), and \(m\) are parameters of the algorithm, which need to be tuned to obtain good performance.
Maximum Entropy Justification
The inhomogeneous FRAME model can be justified by the maximum entropy principle. Suppose the true distribution that generates the observed images \(\{\mathbf{I}_m\}\) is \(f(\mathbf{I})\). Let \(\lambda ^{\star }\) solve the population version of the maximum likelihood equation:
Let \(\varOmega \) be the set of all the distributions \(p(\mathbf{I})\) such that
Then \(f \in \varOmega \). Let \(\Lambda \) be the set of all the distributions \(\{p_{\lambda }, \forall \lambda \}\), where \(p_{\lambda }(\mathbf{I}) = p(\mathbf{I}; \lambda )\). Then \(q \in \Lambda \) since \(q(\mathbf{I}) = p(\mathbf{I}; \lambda = 0)\). Thus \(p_{\lambda ^\star }\) is the intersection between \(\Lambda \) and \(\varOmega \). In Fig. 17, \(\Lambda \) and \(\varOmega \) are illustrated by blue and green curves respectively, where each point on the curves is a probability distribution. The two curves \(\Lambda \) and \(\varOmega \) are “orthogonal” in the sense that for any \(p_{\lambda } \in \Lambda \) and for any \(p \in \varOmega \), it can be easily proved that the Pythagorean property
holds (Pietra et al. 1997), where \(\mathrm{KL}(pq)\) is the KullbackLeibler divergence from \(p\) to \(q\). This Pythagorean property leads to the following dual properties of \(p_{\lambda ^{\star }}\):

(1)
Maximum likelihood: Among all \(p_{\lambda } \in \Lambda \), \(p_{\lambda ^{\star }}\) achieves the minimum of \(\mathrm{KL}(fp_{\lambda })\).

(2)
Maximum entropy or minimum divergence: Among all \(p \in {\varOmega }\), \(p_{\lambda ^{\star }}\) achieves the minimum of \(\mathrm{KL}(pq)\). Thus \(p_{\lambda ^{\star }}\) can be considered the minimal modification of the reference distribution \(q\) to match the statistical properties of the true distribution \(f\).
The above justification is also true for the sparse FRAME model.
For sparsification, in principle, we can select \(B_{x_i, s_i, \alpha _i}\) sequentially using a procedure like projection pursuit (Friedman 1987) or filter pursuit (Zhu et al. 1998). Suppose we have selected \(k\) basis functions \((B_{x_i, s_i, \alpha _i}, i = 1, \ldots , k)\), and let \(p_k\) be the fitted model with the corresponding \(\lambda = (\lambda _i, i = 1, \ldots , k)\) estimated by MLE. Suppose we are to select the next basis function \(B_{x_{k+1}, s_{k+1}, \alpha _{k+1}}\). Let \(p_{k+1}\) be the fitted model. Then we want to minimize \(\mathrm{KL}(fp_{k+1}) = \mathrm{KL}(fp_{k})  \mathrm{KL}(p_{k+1}p_k)\), that is, we want to maximize \(\mathrm{KL}(p_{k+1}p_k)\), which serves as the pursuit index. The problem with such a procedure is that each time we need to fit \(p_k\) which involves MCMC computation, and the pursuit index is also difficult to compute. So we choose to pursue a different approach by exploring the connection between sparse FRAME and the shared sparse coding.
Sparse FRAME and Shared Sparse Coding
From sparse FRAME to shared sparse coding Let us assume that the reference distribution \(q(\mathbf{I})\) in the sparse FRAME model (15) is a Gaussian white noise model so that the pixel intensities follow \(\mathrm{N}(0, \sigma ^2)\) independently. For sparse FRAME, it is natural to assume that the number of selected basis functions \(n\) is much less than the number of pixels in \(\mathbf{I}\), i.e., \(n \ll \mathcal{D}\), where \(\mathcal{D}\) is the image domain. For notational convenience, we can make \(\mathbf{I}\) and \(B_i = B_{x_i, s_i, \alpha _i}\), \(i = 1, \ldots , n\) into \(\mathcal{D}\)dimensional vectors, and let \(\mathbf{B}= (B_1, \ldots , B_n)\) be the resulting \(\mathcal{D} \times n\) matrix.
The connection between sparse FRAME and shared sparse coding is most evident if we temporarily assume that the selected basis functions \((B_i, i = 1, \ldots , n)\) are orthogonal (with unit \(\ell _2\) norm as assumed before). Extension to nonorthogonal \(\mathbf{B}\) is straightforward but requires tedious notation (such as \((\mathbf{B}^{T}\mathbf{B})^{1}\)). For \(\mathbf{B}\), we can construct \(\bar{n} = \mathcal{D}  n\) basis vectors of unit norm \(\bar{B}_1, \ldots , \bar{B}_{\bar{n}}\) that are orthogonal to each other and that are also orthogonal to \((B_i, i = 1, \ldots , n)\). Thus each image \(\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \sum _{i=1}^{\bar{n}} \bar{r}_i \bar{B}_i\), where \(r_i = \langle \mathbf{I}, B_i \rangle \), and \(\bar{r}_i = \langle \mathbf{I}, \bar{B}_i\rangle \). So we have the linear additive model \(\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \epsilon \), with \(\epsilon = \sum _{i=1}^{\bar{n}} \bar{r}_i \bar{B}_i\) being the least squares residual image.
Under the Gaussian white noise \(q(\mathbf{I})\), \(r_i\) and \(\bar{r}_i\) are all independent \(\mathrm{N}(0, \sigma ^2)\) random variables because of the orthogonality of \((\mathbf{B}, \bar{\mathbf{B}})\). Let \(R\) be the column vector whose elements are \(r_i\), and \(\bar{R}\) be the column vector whose elements are \(\bar{r}_i\). Then under the sparse FRAME model (15), only the distribution of \(R\) is modified during the change from \(q(\mathbf{I})\) to \(p(\mathbf{I}; \mathbf{B}, \lambda )\), which changes the distribution of \(R\) from Gaussian white noise \(q(R)\) to
while the distribution of the residual coordinates \(\bar{R}\) remains Gaussian white noise, and \(R\) and \(\bar{R}\) remain independent. That is, \(p(R, \bar{R}; \lambda ) = p(R; \lambda ) q(\bar{R}) \).
Thus the sparse FRAME model implies a linear additive model \(\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \epsilon \), where \(R \sim p(R; \lambda )\) and \(\epsilon \) is a Gaussian white noise in the \(\bar{n}\)dimensional residual space, and \(\epsilon \) is independent of \(R\). If we observe independent training images \(\{\mathbf{I}_m, m = 1, \ldots , M\}\) from the model, then \(\mathbf{I}_m = \sum _{i=1}^{n} r_{m, i} B_i + \epsilon _m\), i.e., \(\{\mathbf{I}_m\}\) share a common set of basis functions \(\mathbf{B}= (B_i, i = 1, \ldots , n)\) that provide sparse coding for multiple images simultaneously.
From shared sparse coding to sparse FRAME Conversely, suppose we are given a shared sparse coding model of the form \(\mathbf{I}=\sum _{i=1}^{n} c_i B_i + \epsilon = \mathbf{B}C + \epsilon \), where \(C\) is a column vector whose components are \(c_i\). Assume \(C \sim p(C)\) and \(\epsilon \sim \mathrm{N}(0, I \sigma ^2)\), where \(I\) is the \(\mathcal{D}\)dimensional identity matrix, and \(\epsilon \) and \(C\) are independent. Let \(\delta = \mathbf{B}^T \epsilon \), each component of which \(\delta _i = \langle \epsilon , B_i\rangle \sim \mathrm{N}(0, \sigma ^2)\) independently. Then we can write \(\mathbf{I}= \mathbf{B}R + \bar{\mathbf{B}}\bar{R}\), where \(R = C + \delta \), and \(\bar{\epsilon } = \bar{\mathbf{B}}\bar{R}\) is the projection of \(\epsilon \) onto the space of \(\bar{\mathbf{B}}\). Let \(\tilde{p}(R)\) be the density of \(R = C+ \delta \), which is obtained by convolving \(p(C)\) with Gaussian white noise density. Then \(p(\mathbf{I}) = \tilde{p}(R) q(\bar{R}) = q(\mathbf{I}) \tilde{p}(R) /q(R)\) since \(q(\mathbf{I}) = q(R)q(\bar{R})\) under Gaussian white noise model (\(d\mathbf{I}= dR d\bar{R}\) under orthogonality so there is no Jacobian term). If we choose to model \(\tilde{p}(R)/q(R) = \exp \left( \sum _{i=1}^{n} \lambda _i r_i\right) /Z(\lambda )\), we arrive at the sparse FRAME model.
Selection of basis functions For orthogonal \(\mathbf{B}\), as shown above, the probability density \(p(\mathbf{I}; \mathbf{B}, \lambda ) = q(\bar{R}) p(R; \lambda ) = q(\bar{R}) q(R) \exp \left( \sum _{i=1}^{n} \lambda _i r_i\right) /Z(\lambda )\). Given a set of training images \(\{\mathbf{I}_m, m = 1, \ldots , M\}\), and for a candidate set of basis functions \(\mathbf{B}= (B_i, i= 1, \ldots , n)\), we can estimate \(\lambda = (\lambda _i, i = 1, \ldots , n)\) by MLE, giving us \(\lambda ^{\star }\), and the resulting loglikelihood is
Suppose we are to choose a \(\mathbf{B}\) from a collection of candidates. Ideally we should maximize the sum of (43) and (44). We may interpret (43) to be the negative coding length of the residual image \(\epsilon \) by the Gaussian white noise model, and interpret (44) to be the negative coding length of the coefficients \(R_m\) by the fitted model \(p(R; \lambda ^{\star })\). If \(\sigma ^2\) is small, (43) can be more important, while the coding length of \(R_m\) for different \(\mathbf{B}\) may not differ too much in comparison. So we choose to seek a \(\mathbf{B}\) to maximize only (43) or equivalently minimize the overall reconstruction error \(\sum _{m=1}^{M} \mathbf{I}_m  \mathbf{B}R_m^2\). This reflects a twostage strategy in modeling \(\{\mathbf{I}_m\}\). First, we find a set of basis functions \(\mathbf{B}\) to reconstruct \(\{\mathbf{I}_m\}\) as accurately as possible. Then we fit a statistical model for the reconstruction coefficients.
Nonorthogonality Even if \(\mathbf{B}\) is not orthogonal, which is the case in our work, the connection between the sparse FRAME and shared sparse coding still holds. The responses \(R = \mathbf{B}^{T} \mathbf{I}\), but the reconstruction coefficients become \(C = (\mathbf{B}^{T}\mathbf{B})^{1}R\). The projection of \(\mathbf{I}\) onto the subspace spanned by \(\mathbf{B}\) is \(\mathbf{B}C\). We can continue to assume the implicit \(\bar{\mathbf{B}}= (\bar{B}_i, i = 1, \ldots , \bar{n})\) to be orthonormal, and that they are orthogonal to the columns of \(\mathbf{B}\). We can also continue to let \(\bar{R} = \bar{\mathbf{B}}^{T}\mathbf{I}\). In this setting, \(R\) and \(\bar{R}\) are still independent under the Gaussian white noise model \(q(\mathbf{I})\) because \(\mathbf{B}\) and \(\bar{\mathbf{B}}\) are still orthogonal to each other. Under the sparse FRAME model (15), it is still the case that only the distribution of \(R\) is modified during the change from \(q(\mathbf{I})\) to \(p(\mathbf{I}; \mathbf{B}, \lambda )\), while the distribution of \(\bar{R}\) remains white noise and is independent of \(R\). The distribution of \(R\) implies a distribution of the reconstruction coefficients \(C\) because they are linked by a linear transformation. In fact, the distribution of \(C\) is:
where \(q_C(C)\) is the distribution of \(C\) under the reference distribution \(q(\mathbf{I})\), and for a vector \(u\), \(u\) means the vector obtained by taking the absolute values of \(u\) componentwise. Now the distributions of \(R\) and \(C\) involve the Jacobian terms such that \(dR d\bar{R} = \mathrm{det}(\mathbf{B}^{T}\mathbf{B})^{1/2} d\mathbf{I}= \mathrm{det}(\mathbf{B}^{T}\mathbf{B}) dC d\bar{R}\). In fact \(p(\mathbf{I}; \mathbf{B}, \lambda ) = p_C(C; \lambda )q_{\bar{R}}(\bar{R}) \det (\mathbf{B}^{T}\mathbf{B})^{1/2}\). By the same logic as in (43) and (44), we still want to find \(\mathbf{B}\) to minimize the overall reconstruction error \(\sum _{m=1}^{M}\Vert \mathbf{I}_m  \mathbf{B}C_m\Vert ^2\).
Under the shared sparse coding model, it is tempting to model the coefficients \(C\) of the selected basis functions directly. However, \(C\) is still a multidimensional vector, and direct modeling of \(C\) can be difficult. One may assume that the components of \(C\) are statistically independent for simplicity, but this assumption is unlikely to be realistic. So after selecting the basis functions, we choose to model the image intensities by the inhomogeneous FRAME model. Even though this model only matches the marginal distributions of filter responses of the selected basis functions, the model does not assume that the responses are independent.
Rights and permissions
About this article
Cite this article
Xie, J., Hu, W., Zhu, SC. et al. Learning Sparse FRAME Models for Natural Image Patterns. Int J Comput Vis 114, 91–112 (2015). https://doi.org/10.1007/s112630140757x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s112630140757x