The Poisson transform for unnormalised statistical models

Barthelmé, Simon; Chopin, Nicolas

doi:10.1007/s11222-015-9559-4

The Poisson transform for unnormalised statistical models

Published: 11 June 2015

Volume 25, pages 767–780, (2015)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Simon Barthelmé¹ &
Nicolas Chopin²

547 Accesses
3 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Contrary to standard statistical models, unnormalised statistical models only specify the likelihood function up to a constant. While such models are natural and popular, the lack of normalisation makes inference much more difficult. Extending classical results on the multinomial-Poisson transform (Baker In: J Royal Stat Soc 43(4):495–504, 1994), we show that inferring the parameters of a unnormalised model on a space $\Omega $ can be mapped onto an equivalent problem of estimating the intensity of a Poisson point process on $\Omega $. The unnormalised statistical model now specifies an intensity function that does not need to be normalised. Effectively, the normalisation constant may now be inferred as just another parameter, at no loss of information. The result can be extended to cover non-IID models, which includes for example unnormalised models for sequences of graphs (dynamical graphs), or for sequences of binary vectors. As a consequence, we prove that unnormalised parameteric inference in non-IID models can be turned into a semi-parametric estimation problem. Moreover, we show that the noise-contrastive estimation method of Gutmann and Hyvärinen (J Mach Learn Res 13(1):307–361, 2012) can be understood as an approximation of the Poisson transform, and extended to non-IID settings. We use our results to fit spatial Markov chain models of eye movements, where the Poisson transform allows us to turn a highly non-standard model into vanilla semi-parametric logistic regression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic Modeling in Machine Learning

Partial quasi-likelihood analysis

Article 01 June 2018

Nakahiro Yoshida

The Discrete Analogue of the Weibull G Family: Properties, Different Applications, Bayesian and Non-Bayesian Estimation Methods

Article 05 April 2021

Mohamed Ibrahim, M. Masoom Ali & Haitham M. Yousof

References

Baddeley, A., Berman, M., Fisher, N.I., Hardegen, A., Milne, R.K., Schuhmacher, D., Shah, R., Turner, R.: Spatial logistic regression and change-of-support in Poisson point processes. Electro. J. Stat. 4, 1151–1201 (2010)
Article MATH MathSciNet Google Scholar
Baddeley, A., Coeurjolly, J.-F., Rubak, E., Waagepetersen, R.: Logistic regression for spatial Gibbs point processes. Biometrika 101(2), 377–392 (2014)
Baker, S.G.: The multinomial-Poisson transformation. J. Royal Stat. Soc. 43(4), 495–504 (1994)
Google Scholar
Barthelmé, S., Trukenbrod, H., Engbert, R., Wichmann, F.: Modeling fixation locations using spatial point processes. J. Vis. 13(12), 1 (2013)
Article Google Scholar
Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput. 21(6), 1601–1621 (2009)
Article MATH MathSciNet Google Scholar
Caimo, A., Friel, N.: Bayesian inference for exponential random graph models. Soc. Netw. 33(1), 41–55 (2011)
Article Google Scholar
Engbert, R., Trukenbrod, H.A., Barthelmé, S., Wichmann, F.A.: Spatial statistics and attentional dynamics in scene viewing. J. Vis. 15(1), 14 (2014)
Article Google Scholar
Foulsham, T., Kingstone, A., Underwood, G.: Turning the world around: patterns in saccade direction vary with picture orientation. Vis. Res. 48(17), 1777–1790 (2008)
Article Google Scholar
Geyer, C.J.: Estimating normalizing constants and reweighting mixtures in markov chain monte carlo. Technical Report 568, School of Statistics, University of Minnesota (1994)
Girolami, M., Lyne, A.-M., Strathmann, H., Simpson, D., Atchade, Y.: Playing russian roulette with intractable likelihoods. arxiv:1306.4032 (2013)
Gu, M.G., Zhu, H.-T.: Maximum likelihood estimation for spatial models by markov chain monte carlo stochastic approximation. J. Royal Stat. Soc. 63(2), 339–355 (2001)
Article MATH MathSciNet Google Scholar
Gutmann, M., Ichiro Hirayama, J.: Bregman divergence as general framework to estimate unnormalized statistical models. In: Cozman, F.G., Pfeffer, A. (eds.) Uncertainty in Artificial Intelligence, pp. 283–290. AUAI Press, Barcelona (2011)
Google Scholar
Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13(1), 307–361 (2012)
MATH MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, New York (2003). corrected edition
Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)
Article MATH MathSciNet Google Scholar
Kienzle, W., Franz, M.O., Schölkopf, B., Wichmann, F.A.: Center-surround patterns emerge as optimal predictors for human saccade targets. J. Vis. 9(5), 7 (2009)
Article Google Scholar
Kingman, J.F.C.: Poisson Processes (Oxford Studies in Probability). Oxford University Press, London (1993)
Google Scholar
Li, P., König, A.C.: Theory and applications of b-bit minwise hashing. Commun. ACM 54(8), 101–109 (2011)
Article Google Scholar
Minka, T.: Divergence Measures and Message Passing. Technical report, Microsoft Research Technical Report (2005)
Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 2265–2273. Curran Associates Inc, Red Hook (2013)
Google Scholar
Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the 29th International Conference on Machine Learning, pp. 1751–1758 (2012a)
Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, 26 June–1 July, 2012 (2012b)
Møller, J., Pettitt, A.N., Reeves, R., Berthelsen, K.K.: An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458 (2006)
Murray, I., Ghahramani, Z., MacKay, D.: MCMC for doubly-intractable distributions. In: Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI) (2006)
Pihlaja, M., Gutmann, M., Hyvärinen, A.: A family of computationally E cient and simple estimators for unnormalized statistical models. In: UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, pp. 442–449, 8–11 July 2010, (2010)
Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: International Conference on Artificial Intelligence and Statistics, pp. 448–455 (2009)
Schölkopf, B., Smola, A.J.: Learning with Kernels : Support Vector Machines, Regularization, Optimization, and Beyond, 1st edn. The MIT Press, Cambridge (2001)
Google Scholar
Tatler, B., Vincent, B.: The prominence of behavioural biases in eye guidance. Vis. Cogn. 17(6), 1029–1054 (2009)
Article Google Scholar
Van der Vaart, A.W.: Asymptotic statistics. Cambridge University Press, Cambridge (2000)
Walker, S.G.: Posterior sampling when the normalizing constant is unknown. Commun. Stat. 40(5), 784–792 (2011)
Article MATH Google Scholar
Wang, C., Komodakis, N., Paragios, N.: Markov random field modeling, inference & learning in computer vision & image understanding: a survey. Comput. Vis. Image Underst. 117(11), 1610–1627 (2013)
Article Google Scholar
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688 (2011)
Wood, S.: Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science), 1st edn. Chapman and Hall/CRC, Boca Raton (2006)
Google Scholar
Wood, S.N.: Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. Royal Stat. Soc. 73(1), 3–36 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, GIPSA-Lab, F-38000, Grenoble, France
Simon Barthelmé
CREST-ENSAE and HEC, Paris, France
Nicolas Chopin

Authors

Simon Barthelmé
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Chopin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Chopin.

Appendices

Derivatives of Poisson-transformed likelihoods

The first and second derivatives of $\mathcal {L}\left( \varvec{\theta }\right) $ and $\mathcal {M}\left( \varvec{\theta },\nu \right) $ are needed in the proofs and we collect them here.

Derivatives of $\mathcal {L}\left( \varvec{\theta }\right) $:

$$\begin{aligned}&\mathcal {L}(\varvec{\theta }) = \sum _{i=1}^{n}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\log \left( \int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) \right\} \text{ d }s\right) \\&\qquad :=\sum _{i=1}^{n}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\phi \left( \varvec{\theta }\right) \\&\frac{\partial }{\partial \varvec{\theta }}\phi \left( \varvec{\theta }\right) = \!\int \!\frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}(s)\text{ exp }\left\{ f_{\varvec{\theta }}\!\left( s\right) \!-\!\phi \left( \varvec{\theta }\right) \right\} \text{ d }s\!=\!E_{\varvec{\theta }}\!\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \\&\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\phi \left( \varvec{\theta }\right) = E_{\varvec{\theta }}\left( \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\right) \!+\!E_{\varvec{\theta }}\!\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \!\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \\&\qquad \qquad \!\!\!\quad \,\,-E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\\&\frac{\partial }{\partial \varvec{\theta }}\mathcal {L}= \sum _{i=1}^{n}\frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\frac{d}{d\varvec{\theta }}\phi \left( \varvec{\theta }\right) \\&\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\mathcal {L}\left( \varvec{\theta }\right) = \sum \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\frac{d^{2}}{d\varvec{\theta }^{2}}\phi \left( \varvec{\theta }\right) \end{aligned}$$

where we have used $E_{\varvec{\theta }}$ as shorthand for the expectation with respect to density $\text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) -\phi \left( \varvec{\theta }\right) \right\} $.

Derivatives of $\mathcal {M}\left( \varvec{\theta },\nu \right) $:

$$\begin{aligned}&\mathcal {M}\left( \varvec{\theta },\nu \right) = \sum _{i=1}^{n}\left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})\!+\!\nu \right\} \!-\!n\int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) \!+\!\nu \right\} \text{ d }s\\&\frac{\partial }{\partial \varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right) = \sum _{i=1}^{n}\frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}(\mathbf {y}_{i})-nE_{\varvec{\theta },\nu }\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \\&\frac{\partial }{\partial \nu }\mathcal {M}\left( \varvec{\theta },\nu \right) = n-n\int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) +\nu \right\} \text{ d }s\\&\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right) = \sum \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\left( \mathbf {y}_{i}\right) -n\left( E_{\varvec{\theta },\nu }\left( \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}f_{\varvec{\theta }}\right) \right. \\&\qquad \qquad \qquad \,\left. +\,E_{\varvec{\theta },\nu }\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \right) \\&\frac{\partial ^{2}}{\partial \nu ^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right) = -n\int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) +\nu \right\} \text{ d }s\\&\frac{\partial }{\partial \varvec{\theta }\partial \nu }\mathcal {M}\left( \varvec{\theta },\nu \right) = -nE_{\varvec{\theta },\nu }\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \end{aligned}$$

where we have used $E_{\varvec{\theta },\nu }$ as shorthand for the linear operator $E_{\varvec{\theta },\nu }(\varphi )=\int \varphi (s)\text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) +\nu \right\} \,\text{ d }s$ (which is not an expectation in general).

Further properties of the Poisson transform

1.1 The Poisson transform preserves confidence intervals

The usual method for obtaining confidence intervals for $\varvec{\theta }$ is to invert the Hessian matrix of $\mathcal {L}\left( \varvec{\theta }\right) $ at the mode, $\varvec{\theta }^{\star }$:

$$\begin{aligned} \mathbf {C}_{\mathcal {L}}=\left( -\frac{d^{2}}{d^{2}\varvec{\theta }}\mathcal {L}\left| _{\varvec{\theta }=\varvec{\theta }^{\star }}\right. \right) ^{-1} \end{aligned}$$

We can show that the same confidence intervals can be obtained from $\mathcal {M}\left( \varvec{\theta },\nu \right) $ at the joint mode, $\varvec{\theta }^{\star },\nu ^{\star }$.

At the joint maximum, $\nu ^{\star }$ normalises the intensity function, and the Hessian of $\mathcal {M}$ equals

$$\begin{aligned} H= & {} \left[ \begin{array}{cc} \mathbf {H}_{aa} &{} \mathbf {H}_{ba}\\ \mathbf {H}_{ab} &{} \mathbf {H}_{bb} \end{array}\right] =\left[ \begin{array}{cc} \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right) &{} \frac{\partial }{\partial \varvec{\theta }\partial \nu }\mathcal {M}\left( \varvec{\theta },\nu \right) \\ \frac{\partial }{\partial \nu \partial \varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right) &{} \frac{\partial ^{2}}{\partial ^{2}\nu }\mathcal {M}\left( \varvec{\theta },\nu \right) \end{array}\right] \\= & {} \left[ \begin{array}{cc} \!\sum \!\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\left( \mathbf {y}_{i}\right) -nE_{\varvec{\theta }}\left( \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}f\right) - nE_{\varvec{\theta }}\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) &{} -nE_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\\ -nE_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) &{} -n \end{array}\right] \end{aligned}$$

where again E denotes the expectation with respect to density $\text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) -\phi \left( \varvec{\theta }\right) \right\} $.

Inverting $-H$ also yields confidence intervals. By the inversion rule for block matrices, the approximate covariance for $\varvec{\theta }$ using $\mathcal {M}\left( \varvec{\theta },\nu \right) $ equals

$$\begin{aligned} \mathbf {C}_{\mathcal {M}}^{-1}= & {} -\left( \mathbf {H}_{aa}-\mathbf {H}_{ba}\mathbf {H}_{bb}^{-1}\mathbf {H}_{ab}\right) \\= & {} -\left( \mathbf {H}_{aa}+\frac{1}{n}n^{2}E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \\= & {} -\Bigg [\sum \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\left( \mathbf {y}_{i}\right) -nE_{\varvec{\theta }}\left( \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}f\right) \\&-\,nE_{\varvec{\theta }}\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \\&+\,nE_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) E\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\Bigg ]\\= & {} \mathbf {C}_{\mathcal {L}}^{-1} \end{aligned}$$

1.2 Preservation of log-concavity in exponential families

In exponential families, the log-likelihood is concave, which facilitates inference. The Poisson transform preserves this log-concavity.

In the natural parameterisation, exponential family models are given by

$$\begin{aligned} \mathcal {L}\left( \varvec{\theta }\right) =\exp \left\{ \sum _{i=1}^{n}s(\mathbf {y}_{i})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right\} \end{aligned}$$

with $s(\mathbf {y})$ a vector of sufficient statistics. The second derivative of $\mathcal {L}\left( \varvec{\theta }\right) $ simplifies to

$$\begin{aligned} -\frac{1}{n}\frac{\partial }{\partial ^{2}\varvec{\theta }}\mathcal {L}\left( \varvec{\theta }\right)= & {} \int s(\mathbf {y})s(\mathbf {y})^{t}\exp \left( s(\mathbf {y})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right) \\= & {} E_{\varvec{\theta }}\left\{ s(\mathbf {y})s(\mathbf {y})^{t}\right\} \end{aligned}$$

a p.s.d. matrix, which establishes concavity.

The second derivatives of $\mathcal {M}\left( \varvec{\theta },\nu \right) $ (Section 1) also simplify

$$\begin{aligned} -\frac{1}{n}\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right)= & {} \exp \left\{ \nu -\nu ^{\star }\left( \varvec{\theta }\right) \right\} \\&\times \int s(\mathbf {y})s(\mathbf {y})^{t}\exp \left( s(\mathbf {y})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right) \\ -\frac{1}{n}\frac{\partial ^{2}}{\partial \nu \partial \varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right)= & {} \exp \left\{ \nu -\nu ^{\star }\left( \varvec{\theta }\right) \right\} \\&\times \int s(\mathbf {y})\exp \left( s(\mathbf {y})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right) \\ -\frac{1}{n}\frac{\partial ^{2}}{\partial \nu {}^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right)= & {} \exp \left\{ \nu -\nu ^{\star }\left( \varvec{\theta }\right) \right\} \end{aligned}$$

so that the full Hessian $\mathbf {H}$ can be written in block form as follows:

$$\begin{aligned} -\frac{1}{n}\exp \left\{ \nu ^{\star }\left( \varvec{\theta }\right) -\nu \right\} \mathbf {H}=\left[ \begin{array}{cc} E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\right\} &{} E\left( s\left( \mathbf {y}\right) \right) \\ E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) ^{t}\right\} &{} 1 \end{array}\right] \!=\!\mathbf {A} \end{aligned}$$

and $\mathbf {H}$ is n.s.d if and only if for all $\mathbf {x},c$ such that $({\mathbf {x}},c)\ne {\mathbf {0}}$:

$$\begin{aligned} \left[ \begin{array}{cc} \mathbf {x}^{t}&c\end{array}\right] \mathbf {A}\left[ \begin{array}{c} \mathbf {x}\\ c \end{array}\right] >0 \end{aligned}$$

which the following establishes:

$$\begin{aligned}&\left[ \begin{array}{cc} \mathbf {x}^{t}&c\end{array}\right] \left[ \begin{array}{cc} E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\right\} &{} E\left\{ s\left( \mathbf {y}\right) \right\} \\ E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) ^{t}\right\} &{} 1 \end{array}\right] \left[ \begin{array}{c} \mathbf {x}\\ c \end{array}\right] \\&\quad = \left[ \begin{array}{cc} \mathbf {x}^{t}&c\end{array}\right] \left[ \begin{array}{c} E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\right\} \mathbf {x}+cE_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) \right\} \\ E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) ^{t}\right\} \mathbf {x}+c \end{array}\right] \\&\quad = E_{\varvec{\theta }}\left\{ \mathbf {x}^{t}s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\mathbf {x}\right\} +2E_{\varvec{\theta }}\left\{ \mathbf {x}^{t}s\left( \mathbf {y}\right) \right\} c+c^{2}\\&\quad = E_{\varvec{\theta }}\left[ \left( s\left( \mathbf {y}\right) ^{t}\mathbf {x}+c\right) ^{2}\right] >0 \end{aligned}$$

assuming $E_{\varvec{\theta }}\left\{ s(\mathbf {y})s(\mathbf {y})^{t}\right\} $ is p.s.d. for all $\varvec{\theta }$.

1.3 Noise-constrative divergence approximates the Poisson transform (Theorem 7)

We have assumed that

$$\begin{aligned} f_{\varvec{\theta }}(\mathbf {y})-\log q(\mathbf {y})\le C(\varvec{\theta }) \end{aligned}$$

for a certain constant $C(\varvec{\theta })$ that may depend on $\varvec{\theta }$, and all $\mathbf {y}\in \Omega $. We rewrite the log odds ratio as $h(\mathbf {y})-\log (m)$ where

$$\begin{aligned} h(\mathbf {y}):=f_{\varvec{\theta }}(\mathbf {y})+\nu -\log q(\mathbf {y})+\log (n) \end{aligned}$$

does not depend on m; note $h(\mathbf {y})\le \bar{h}:=C(\varvec{\theta })+\nu +\log (n)$. One has

$$\begin{aligned}&{\mathcal {R}}^{m}(\varvec{\theta },\nu )+\log (m/n)= \sum _{i=1}^{n}\log \left[ \frac{m\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} }{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} +mq(\mathbf {y}_{i})}\right] \\&\quad +\sum _{j=1}^{m}\log \left[ \frac{mq({\mathbf {r}}_{j})}{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{j})+\nu \right\} +mq({\mathbf {r}}_{j})}\right] \end{aligned}$$

where the first term trivially converges (as $m\rightarrow +\infty $) to

$$\begin{aligned} \sum _{i=1}^{n}\left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu -\log q(\mathbf {y}_{i})\right\} . \end{aligned}$$

Regarding the second term, one has

$$\begin{aligned}&\log \left[ \frac{mq({\mathbf {r}}_{j})}{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{j})+\nu \right\} +mq({\mathbf {r}}_{j})}\right] \\&\quad =\log \left[ 1-\frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }\right] \end{aligned}$$

where

$$\begin{aligned} 0\le \frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }\le \frac{1}{m}\exp (\bar{h}). \end{aligned}$$

Since $\left| \log (1-x)+x\right| \le x^{2}$ for $x\in [0,1/2]$, we have, for m large enough, that

$$\begin{aligned}&\left| \log \left[ \frac{mq({\mathbf {r}}_{j})}{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{j})+\nu \right\} +mq({\mathbf {r}}_{j})}\right] \right. \nonumber \\&\quad \left. +\frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }\right| \le \frac{\exp (2\bar{h})}{m^{2}} \end{aligned}$$

(5.1)

and

$$\begin{aligned} \left| \frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }-\frac{1}{m}\exp \left\{ h({\mathbf {r}}_{j})\right\} \right| \le \frac{\exp (2\bar{h})}{m^{2}} \end{aligned}$$

and since, by the law of large numbers,

$$\begin{aligned}&\frac{1}{m}\sum _{j=1}^{m}\exp \left\{ h({\mathbf {r}}_{i})\right\} \rightarrow \mathbb {E}_{q}[\exp \left\{ h({\mathbf {r}}_{i})\right\} ]\nonumber \\&\quad =n\int \exp \left\{ f_{\varvec{\theta }}(\mathbf {y})+\nu \right\} \, d\mathbf {y}<+\infty \end{aligned}$$

(5.2)

almost surely as $m\rightarrow +\infty $, one also has

$$\begin{aligned}&\sum _{j=1}^{m}\log \left[ \frac{mq(\mathbf {y}_{i})}{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\eta \right\} +mq(\mathbf {y}_{i})}\right] \rightarrow \\&\quad -n\int \exp \left\{ f_{\varvec{\theta }}(\mathbf {y})+\nu \right\} \, d\mathbf {y}\end{aligned}$$

almost surely, since the difference between the two sums is bounded deterministically by $\exp (2\bar{h})/m$.

1.4 Uniform convergence of the noise-constrative divergence (Theorem 8)

We first prove two intermediate results.

Lemma 9

Assuming that $\left| f_{\varvec{\theta }}(\mathbf {y})-\log q(\mathbf {y})\right| \le C$ for all $\mathbf {y}\in \Omega $, then there exists a bounded interval I such that, for any $\varvec{\theta }$, the maximum of both functions $\nu \rightarrow \mathcal {M}(\varvec{\theta },\nu )$ and $\nu \rightarrow {\mathcal {R}}^{m}(\varvec{\theta },\nu )$ is attained in I.

Proof

Let $\varvec{\theta }$ some fixed value. $\mathcal {M}(\varvec{\theta },\nu )$ is maximised at $\nu ^{\star }(\varvec{\theta })=-\log \int _{\Omega }\exp \left\{ f_{\varvec{\theta }}(\mathbf {y})\right\} \, d\mathbf {y}\in \left[ -C,C\right] $, since $e^{-C}q\le f_{\varvec{\theta }}\le e^{C}q$. For ${\mathcal {R}}^{m}(\varvec{\theta },\nu )$, using again $e^{-C}q\le f_{\varvec{\theta }}\le e^{C}q$, one sees that $l(\nu )\le {\mathcal {R}}^{m}(\varvec{\theta },\nu )\le u(\nu )$, where l and u are functions of $\nu $ that diverges at $-\infty $ for both $\nu \rightarrow +\infty $ and $\nu \rightarrow -\infty $; i.e.

$$\begin{aligned}&{\mathcal {R}}^{m}(\varvec{\theta },\nu )+\log (m/n)\le u(\nu ):= \sum _{i=1}^{n}\log \left[ \frac{m\exp \left\{ C+\nu \right\} }{n\exp (-C+\nu )+m}\right] \\&\quad +\sum _{j=1}^{m}\log \left[ \frac{m}{n\exp \left\{ -C+\nu \right\} +m}\right] \end{aligned}$$

and the lower bound $l(\nu )$ has a similar expression. Thus, one may construct an interval J such that the maximum of function $\nu \rightarrow {\mathcal {R}}^{m}(\varvec{\theta },\nu )$ is attained in J for all $\varvec{\theta }$ (e.g. take J such that for $\nu \in J^{c}, u(\nu )\le M_{l}/2, l(\nu )\le M_{l}/2$, with $M_{l}=\sup _{\nu }l$) . To conclude, take $I=J\cup [-C,C]$.

We now establish uniform convergence, but, in light of the previous result, we restrict $\nu $ to the interval I defined in Lemma 9.

Lemma 10

Under the Assumptions that (i) $\Theta $ is bounded, that (ii) $\left| f_{\varvec{\theta }}(\mathbf {y})-\log q(\mathbf {y})\right| \le C$ for all $\mathbf {y}\in \Omega $, that (iii) $\left| f_{\varvec{\theta }}(\mathbf {y})-f_{\varvec{\theta }'}(\mathbf {y})\right| \le \kappa (\mathbf {y})\left\| \varvec{\theta }-\varvec{\theta }'\right\| $ with $\kappa $ such that $\mathbb {E}_{q}[\kappa ]<\infty $, one has, for fixed ${\mathcal {S}}=\left\{ \mathbf {y}_{1},\ldots ,\mathbf {y}_{n}\right\} $:

$$\begin{aligned}&\sup _{\left( \varvec{\theta },\nu \right) \in \Theta \times I}\left| {\mathcal {R}}^{m}(\varvec{\theta },\nu )+\log (m/n)+\sum _{i=1}^{n}\log q({\mathbf {y}}_{i})\right. \nonumber \\&\left. \quad -\mathcal {M}(\varvec{\theta },\nu )\right| \rightarrow 0 \end{aligned}$$

(5.3)

almost surely, relative to the randomness induced by ${\mathcal {R}}=\left\{ {\mathbf {r}}_{1},\ldots ,{\mathbf {r}}_{m}\right\} .$

Proof

Recall that the absolute difference above was bounded by the sum of three terms in the previous Appendix. The first term was

$$\begin{aligned}&\sum _{i=1}^{n}\left[ \log \left[ \frac{m\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} }{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} +mq(\mathbf {y}_{i})}\right] \right. \\&\quad \left. -\left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu -\log q(\mathbf {y}_{i})\right\} \right] \end{aligned}$$

which clearly converges deterministically to 0 as $m\rightarrow +\infty .$ In addition, this convergence is uniform with respect to $\left( \varvec{\theta },\nu \right) \in \Theta \times I$, since $\left| \log x-\log y\right| \le c\left| x-y\right| $ for $x,y\ge 1/c$, and here, by Assumption (ii),

$$\begin{aligned} x:= & {} \frac{m\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} }{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} +mq(\mathbf {y}_{i})}\ge \frac{m\exp \left\{ -C+\nu \right\} }{n\exp \left\{ C+\nu \right\} +m}\\\ge & {} \exp \left\{ -C+\nu \right\} \end{aligned}$$

and $y=\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu -\log q(\mathbf {y}_{i})\right\} \ge \exp \left\{ -C+\nu \right\} $, so both x and y are lower bounded since $\nu \in I$. Similarly $(x-y)$ is bounded by $C'/m$, where $C'$ is some constant independent of $\varvec{\theta }$.

The second term, see (5.1), was bounded by $\exp (2\bar{h})/m^{2}$, where $\bar{h}$, an upper bound of h, may now be replaced by a constant, since $h(\mathbf {y}):=f_{\varvec{\theta }}(\mathbf {y})+\nu -\log q(\mathbf {y})+\log (n)\le C+\nu +\log (n)$ and $\nu \in I$, again by Assumption (ii).

The third term is related to the law of large numbers (5.2) for random variable $H_{(\varvec{\theta },\eta )}({\mathbf {r}}_{i}):=\exp \left\{ h({\mathbf {r}}_{i})\right\} $, which depended implicitly on $\left( \varvec{\theta },\eta \right) $:

$$\begin{aligned} H_{(\varvec{\theta },\eta )}({\mathbf {r}}_{i})=\frac{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{i})+\nu \right\} }{q({\mathbf {r}}_{i})}. \end{aligned}$$

To obtain (almost surely) uniform convergence, we use the generalised version of the Glivenko-Cantelli theorem; e.g. Theorem 19.4 p. 270 in Van der Vaart (2000). From Example 19.7 of the same book, one sees that a sufficient condition in our case is that $\Theta $ is bounded [Assumption (i)], and that

$$\begin{aligned} \left| H_{(\varvec{\theta },\eta )}({\mathbf {r}})-H_{(\varvec{\theta }',\eta ')}({\mathbf {r}})\right| \le m({\mathbf {r}})\left\| \varvec{\xi }-\varvec{\xi }'\right\| \end{aligned}$$

for $\varvec{\xi }=(\varvec{\theta },\eta ), \varvec{\xi }'=(\varvec{\theta }',\eta ')$, and m a function such that $\mathbb {E}_{q}[m]<\infty $. But

$$\begin{aligned}&\left| H_{(\varvec{\theta },\eta )}({\mathbf {r}})-H_{(\varvec{\theta }',\eta ')}({\mathbf {r}})\right| =\frac{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}})+\nu \right\} }{q({\mathbf {r}})}\\&\quad \left| 1-\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}})+\nu -f_{\varvec{\theta }'}({\mathbf {r}})-\nu '\right\} \right| \\&\quad \le ne^{C+\nu }\left| 1-\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}})+\nu -f_{\varvec{\theta }'}({\mathbf {r}})-\nu '\right\} \right| \\&\quad \le C'\left\{ \kappa ({\mathbf {r}})\left\| \varvec{\theta }-\varvec{\theta }'\right\| +\left| \nu -\nu '\right| \right\} \\&\quad \le C'\left\{ \kappa ({\mathbf {r}})+1\right\} \left\| \varvec{\xi }-\varvec{\xi }'\right\| \end{aligned}$$

by Assumption (ii), and for some constant $C'$ independent of $\varvec{\theta }$, since $\left| 1-e^{x}\right| \le Kx$ for x, y in a bounded set. One may conclude, since, by Assumption (ii), $\mathbb {E}_{q}[\kappa ]<\infty $.

We are now able to prove Theorem 8. Again, let $\varvec{\xi }=(\varvec{\theta },\nu )$, and rewrite any function of $(\varvec{\theta },\nu )$ as a function of $\varvec{\xi }$, i.e. $\mathcal {M}(\varvec{\xi }), {\mathcal {R}}^{m}(\varvec{\xi })$. By e.g. Theorem 5.7 p. 45 of Van der Vaart (2000), the uniform convergence 5.3 implies that that the maximiser $\hat{\varvec{\xi }}^{m}$ of ${\mathcal {R}}^{m}(\varvec{\theta },\nu )$ converges to the maximiser $\hat{\varvec{\xi }}$ of $\mathcal {M}(\varvec{\theta },\nu )$, provided that (a) the maximisation is with respect to $\left( \varvec{\theta },\nu \right) \in \Theta \times I$; and (b) that $\sup _{d(\varvec{\xi },\hat{\varvec{\xi }})\ge \epsilon }\mathcal {M}(\varvec{\xi })<\mathcal {M}(\hat{\varvec{\xi }})$. However, by Lemma 9 one sees that in (a) the same estimators would be obtained by maximising instead with respect to $\left( \varvec{\theta },\nu \right) \in \Theta \times \mathbb {R}$, and (b) is a direct consequence of Assumption (iv) of the theorem, if one takes for $d(\varvec{\xi },\hat{\varvec{\xi }})$ the supremum norm of $\varvec{\xi }-\hat{\varvec{\xi }}$.

Additional information on the application

In our application we fit a spatial Markov chain model using logistic regression. Since the procedure involves the generation of a random set of reference points, we incur some Monte Carlo error in the estimates. Estimating the magnitude of the Monte Carlo error is just a matter of running the procedure several times to look at variability in the estimates. We did so over five repetitions and report the results in Fig. 5. For each repetition we plot the estimated smooth effect of saccade angle $r_{\mathrm{ang}}$, along with a 95% confidence band. Since smoothing splines are used, smoothing hyperparameters had to be inferred from the data (using REML, Wood 2011), and the reported confidence band is conditional on the estimated value of the smoothing hyperparameters. The fits and confidence bands are extremely stable over independent repetitions. The R command we used was

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barthelmé, S., Chopin, N. The Poisson transform for unnormalised statistical models. Stat Comput 25, 767–780 (2015). https://doi.org/10.1007/s11222-015-9559-4

Download citation

Accepted: 08 March 2015
Published: 11 June 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11222-015-9559-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Poisson transform for unnormalised statistical models

Abstract

Access this article

Similar content being viewed by others