The Poisson transform for unnormalised statistical models


Contrary to standard statistical models, unnormalised statistical models only specify the likelihood function up to a constant. While such models are natural and popular, the lack of normalisation makes inference much more difficult. Extending classical results on the multinomial-Poisson transform (Baker In: J Royal Stat Soc 43(4):495–504, 1994), we show that inferring the parameters of a unnormalised model on a space \(\Omega \) can be mapped onto an equivalent problem of estimating the intensity of a Poisson point process on \(\Omega \). The unnormalised statistical model now specifies an intensity function that does not need to be normalised. Effectively, the normalisation constant may now be inferred as just another parameter, at no loss of information. The result can be extended to cover non-IID models, which includes for example unnormalised models for sequences of graphs (dynamical graphs), or for sequences of binary vectors. As a consequence, we prove that unnormalised parameteric inference in non-IID models can be turned into a semi-parametric estimation problem. Moreover, we show that the noise-contrastive estimation method of Gutmann and Hyvärinen (J Mach Learn Res 13(1):307–361, 2012) can be understood as an approximation of the Poisson transform, and extended to non-IID settings. We use our results to fit spatial Markov chain models of eye movements, where the Poisson transform allows us to turn a highly non-standard model into vanilla semi-parametric logistic regression.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. Baddeley, A., Berman, M., Fisher, N.I., Hardegen, A., Milne, R.K., Schuhmacher, D., Shah, R., Turner, R.: Spatial logistic regression and change-of-support in Poisson point processes. Electro. J. Stat. 4, 1151–1201 (2010)

    MATH  MathSciNet  Article  Google Scholar 

  2. Baddeley, A., Coeurjolly, J.-F., Rubak, E., Waagepetersen, R.: Logistic regression for spatial Gibbs point processes. Biometrika 101(2), 377–392 (2014)

  3. Baker, S.G.: The multinomial-Poisson transformation. J. Royal Stat. Soc. 43(4), 495–504 (1994)

    Google Scholar 

  4. Barthelmé, S., Trukenbrod, H., Engbert, R., Wichmann, F.: Modeling fixation locations using spatial point processes. J. Vis. 13(12), 1 (2013)

    Article  Google Scholar 

  5. Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput. 21(6), 1601–1621 (2009)

    MATH  MathSciNet  Article  Google Scholar 

  6. Caimo, A., Friel, N.: Bayesian inference for exponential random graph models. Soc. Netw. 33(1), 41–55 (2011)

    Article  Google Scholar 

  7. Engbert, R., Trukenbrod, H.A., Barthelmé, S., Wichmann, F.A.: Spatial statistics and attentional dynamics in scene viewing. J. Vis. 15(1), 14 (2014)

    Article  Google Scholar 

  8. Foulsham, T., Kingstone, A., Underwood, G.: Turning the world around: patterns in saccade direction vary with picture orientation. Vis. Res. 48(17), 1777–1790 (2008)

    Article  Google Scholar 

  9. Geyer, C.J.: Estimating normalizing constants and reweighting mixtures in markov chain monte carlo. Technical Report 568, School of Statistics, University of Minnesota (1994)

  10. Girolami, M., Lyne, A.-M., Strathmann, H., Simpson, D., Atchade, Y.: Playing russian roulette with intractable likelihoods. arxiv:1306.4032 (2013)

  11. Gu, M.G., Zhu, H.-T.: Maximum likelihood estimation for spatial models by markov chain monte carlo stochastic approximation. J. Royal Stat. Soc. 63(2), 339–355 (2001)

    MATH  MathSciNet  Article  Google Scholar 

  12. Gutmann, M., Ichiro Hirayama, J.: Bregman divergence as general framework to estimate unnormalized statistical models. In: Cozman, F.G., Pfeffer, A. (eds.) Uncertainty in Artificial Intelligence, pp. 283–290. AUAI Press, Barcelona (2011)

    Google Scholar 

  13. Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13(1), 307–361 (2012)

    MATH  MathSciNet  Google Scholar 

  14. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, New York (2003). corrected edition

    Google Scholar 

  15. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)

    MATH  MathSciNet  Article  Google Scholar 

  16. Kienzle, W., Franz, M.O., Schölkopf, B., Wichmann, F.A.: Center-surround patterns emerge as optimal predictors for human saccade targets. J. Vis. 9(5), 7 (2009)

    Article  Google Scholar 

  17. Kingman, J.F.C.: Poisson Processes (Oxford Studies in Probability). Oxford University Press, London (1993)

    Google Scholar 

  18. Li, P., König, A.C.: Theory and applications of b-bit minwise hashing. Commun. ACM 54(8), 101–109 (2011)

    Article  Google Scholar 

  19. Minka, T.: Divergence Measures and Message Passing. Technical report, Microsoft Research Technical Report (2005)

  20. Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 2265–2273. Curran Associates Inc, Red Hook (2013)

    Google Scholar 

  21. Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the 29th International Conference on Machine Learning, pp. 1751–1758 (2012a)

  22. Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, 26 June–1 July, 2012 (2012b)

  23. Møller, J., Pettitt, A.N., Reeves, R., Berthelsen, K.K.: An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458 (2006)

  24. Murray, I., Ghahramani, Z., MacKay, D.: MCMC for doubly-intractable distributions. In: Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI) (2006)

  25. Pihlaja, M., Gutmann, M., Hyvärinen, A.: A family of computationally E cient and simple estimators for unnormalized statistical models. In: UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, pp. 442–449, 8–11 July 2010, (2010)

  26. Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: International Conference on Artificial Intelligence and Statistics, pp. 448–455 (2009)

  27. Schölkopf, B., Smola, A.J.: Learning with Kernels : Support Vector Machines, Regularization, Optimization, and Beyond, 1st edn. The MIT Press, Cambridge (2001)

    Google Scholar 

  28. Tatler, B., Vincent, B.: The prominence of behavioural biases in eye guidance. Vis. Cogn. 17(6), 1029–1054 (2009)

    Article  Google Scholar 

  29. Van der Vaart, A.W.: Asymptotic statistics. Cambridge University Press, Cambridge (2000)

  30. Walker, S.G.: Posterior sampling when the normalizing constant is unknown. Commun. Stat. 40(5), 784–792 (2011)

    MATH  Article  Google Scholar 

  31. Wang, C., Komodakis, N., Paragios, N.: Markov random field modeling, inference & learning in computer vision & image understanding: a survey. Comput. Vis. Image Underst. 117(11), 1610–1627 (2013)

    Article  Google Scholar 

  32. Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688 (2011)

  33. Wood, S.: Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science), 1st edn. Chapman and Hall/CRC, Boca Raton (2006)

    Google Scholar 

  34. Wood, S.N.: Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. Royal Stat. Soc. 73(1), 3–36 (2011)

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Nicolas Chopin.


Derivatives of Poisson-transformed likelihoods

The first and second derivatives of \(\mathcal {L}\left( \varvec{\theta }\right) \) and \(\mathcal {M}\left( \varvec{\theta },\nu \right) \) are needed in the proofs and we collect them here.

Derivatives of \(\mathcal {L}\left( \varvec{\theta }\right) \):

$$\begin{aligned}&\mathcal {L}(\varvec{\theta }) = \sum _{i=1}^{n}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\log \left( \int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) \right\} \text{ d }s\right) \\&\qquad :=\sum _{i=1}^{n}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\phi \left( \varvec{\theta }\right) \\&\frac{\partial }{\partial \varvec{\theta }}\phi \left( \varvec{\theta }\right) = \!\int \!\frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}(s)\text{ exp }\left\{ f_{\varvec{\theta }}\!\left( s\right) \!-\!\phi \left( \varvec{\theta }\right) \right\} \text{ d }s\!=\!E_{\varvec{\theta }}\!\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \\&\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\phi \left( \varvec{\theta }\right) = E_{\varvec{\theta }}\left( \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\right) \!+\!E_{\varvec{\theta }}\!\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \!\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \\&\qquad \qquad \!\!\!\quad \,\,-E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\\&\frac{\partial }{\partial \varvec{\theta }}\mathcal {L}= \sum _{i=1}^{n}\frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\frac{d}{d\varvec{\theta }}\phi \left( \varvec{\theta }\right) \\&\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\mathcal {L}\left( \varvec{\theta }\right) = \sum \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}(\mathbf {y}_{i})-n\frac{d^{2}}{d\varvec{\theta }^{2}}\phi \left( \varvec{\theta }\right) \end{aligned}$$

where we have used \(E_{\varvec{\theta }}\) as shorthand for the expectation with respect to density \(\text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) -\phi \left( \varvec{\theta }\right) \right\} \).

Derivatives of \(\mathcal {M}\left( \varvec{\theta },\nu \right) \):

$$\begin{aligned}&\mathcal {M}\left( \varvec{\theta },\nu \right) = \sum _{i=1}^{n}\left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})\!+\!\nu \right\} \!-\!n\int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) \!+\!\nu \right\} \text{ d }s\\&\frac{\partial }{\partial \varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right) = \sum _{i=1}^{n}\frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}(\mathbf {y}_{i})-nE_{\varvec{\theta },\nu }\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \\&\frac{\partial }{\partial \nu }\mathcal {M}\left( \varvec{\theta },\nu \right) = n-n\int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) +\nu \right\} \text{ d }s\\&\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right) = \sum \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\left( \mathbf {y}_{i}\right) -n\left( E_{\varvec{\theta },\nu }\left( \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}f_{\varvec{\theta }}\right) \right. \\&\qquad \qquad \qquad \,\left. +\,E_{\varvec{\theta },\nu }\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \right) \\&\frac{\partial ^{2}}{\partial \nu ^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right) = -n\int \text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) +\nu \right\} \text{ d }s\\&\frac{\partial }{\partial \varvec{\theta }\partial \nu }\mathcal {M}\left( \varvec{\theta },\nu \right) = -nE_{\varvec{\theta },\nu }\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \end{aligned}$$

where we have used \(E_{\varvec{\theta },\nu }\) as shorthand for the linear operator \(E_{\varvec{\theta },\nu }(\varphi )=\int \varphi (s)\text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) +\nu \right\} \,\text{ d }s\) (which is not an expectation in general).

Further properties of the Poisson transform

The Poisson transform preserves confidence intervals

The usual method for obtaining confidence intervals for \(\varvec{\theta }\) is to invert the Hessian matrix of \(\mathcal {L}\left( \varvec{\theta }\right) \) at the mode, \(\varvec{\theta }^{\star }\):

$$\begin{aligned} \mathbf {C}_{\mathcal {L}}=\left( -\frac{d^{2}}{d^{2}\varvec{\theta }}\mathcal {L}\left| _{\varvec{\theta }=\varvec{\theta }^{\star }}\right. \right) ^{-1} \end{aligned}$$

We can show that the same confidence intervals can be obtained from \(\mathcal {M}\left( \varvec{\theta },\nu \right) \) at the joint mode, \(\varvec{\theta }^{\star },\nu ^{\star }\).

At the joint maximum, \(\nu ^{\star }\) normalises the intensity function, and the Hessian of \(\mathcal {M}\) equals

$$\begin{aligned} H= & {} \left[ \begin{array}{cc} \mathbf {H}_{aa} &{} \mathbf {H}_{ba}\\ \mathbf {H}_{ab} &{} \mathbf {H}_{bb} \end{array}\right] =\left[ \begin{array}{cc} \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right) &{} \frac{\partial }{\partial \varvec{\theta }\partial \nu }\mathcal {M}\left( \varvec{\theta },\nu \right) \\ \frac{\partial }{\partial \nu \partial \varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right) &{} \frac{\partial ^{2}}{\partial ^{2}\nu }\mathcal {M}\left( \varvec{\theta },\nu \right) \end{array}\right] \\= & {} \left[ \begin{array}{cc} \!\sum \!\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\left( \mathbf {y}_{i}\right) -nE_{\varvec{\theta }}\left( \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}f\right) - nE_{\varvec{\theta }}\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) &{} -nE_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\\ -nE_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) &{} -n \end{array}\right] \end{aligned}$$

where again E denotes the expectation with respect to density \(\text{ exp }\left\{ f_{\varvec{\theta }}\left( s\right) -\phi \left( \varvec{\theta }\right) \right\} \).

Inverting \(-H\) also yields confidence intervals. By the inversion rule for block matrices, the approximate covariance for \(\varvec{\theta }\) using \(\mathcal {M}\left( \varvec{\theta },\nu \right) \) equals

$$\begin{aligned} \mathbf {C}_{\mathcal {M}}^{-1}= & {} -\left( \mathbf {H}_{aa}-\mathbf {H}_{ba}\mathbf {H}_{bb}^{-1}\mathbf {H}_{ab}\right) \\= & {} -\left( \mathbf {H}_{aa}+\frac{1}{n}n^{2}E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) E_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \\= & {} -\Bigg [\sum \frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}f_{\varvec{\theta }}\left( \mathbf {y}_{i}\right) -nE_{\varvec{\theta }}\left( \frac{\partial ^{2}}{\partial ^{2}\varvec{\theta }}f\right) \\&-\,nE_{\varvec{\theta }}\left( \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) \left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\right) \\&+\,nE_{\varvec{\theta }}\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) E\left( \frac{\partial }{\partial \varvec{\theta }}f_{\varvec{\theta }}\right) ^{t}\Bigg ]\\= & {} \mathbf {C}_{\mathcal {L}}^{-1} \end{aligned}$$

Preservation of log-concavity in exponential families

In exponential families, the log-likelihood is concave, which facilitates inference. The Poisson transform preserves this log-concavity.

In the natural parameterisation, exponential family models are given by

$$\begin{aligned} \mathcal {L}\left( \varvec{\theta }\right) =\exp \left\{ \sum _{i=1}^{n}s(\mathbf {y}_{i})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right\} \end{aligned}$$

with \(s(\mathbf {y})\) a vector of sufficient statistics. The second derivative of \(\mathcal {L}\left( \varvec{\theta }\right) \) simplifies to

$$\begin{aligned} -\frac{1}{n}\frac{\partial }{\partial ^{2}\varvec{\theta }}\mathcal {L}\left( \varvec{\theta }\right)= & {} \int s(\mathbf {y})s(\mathbf {y})^{t}\exp \left( s(\mathbf {y})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right) \\= & {} E_{\varvec{\theta }}\left\{ s(\mathbf {y})s(\mathbf {y})^{t}\right\} \end{aligned}$$

a p.s.d. matrix, which establishes concavity.

The second derivatives of \(\mathcal {M}\left( \varvec{\theta },\nu \right) \) (Section 1) also simplify

$$\begin{aligned} -\frac{1}{n}\frac{\partial ^{2}}{\partial \varvec{\theta }^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right)= & {} \exp \left\{ \nu -\nu ^{\star }\left( \varvec{\theta }\right) \right\} \\&\times \int s(\mathbf {y})s(\mathbf {y})^{t}\exp \left( s(\mathbf {y})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right) \\ -\frac{1}{n}\frac{\partial ^{2}}{\partial \nu \partial \varvec{\theta }}\mathcal {M}\left( \varvec{\theta },\nu \right)= & {} \exp \left\{ \nu -\nu ^{\star }\left( \varvec{\theta }\right) \right\} \\&\times \int s(\mathbf {y})\exp \left( s(\mathbf {y})^{t}\varvec{\theta }-\phi \left( \varvec{\theta }\right) \right) \\ -\frac{1}{n}\frac{\partial ^{2}}{\partial \nu {}^{2}}\mathcal {M}\left( \varvec{\theta },\nu \right)= & {} \exp \left\{ \nu -\nu ^{\star }\left( \varvec{\theta }\right) \right\} \end{aligned}$$

so that the full Hessian \(\mathbf {H}\) can be written in block form as follows:

$$\begin{aligned} -\frac{1}{n}\exp \left\{ \nu ^{\star }\left( \varvec{\theta }\right) -\nu \right\} \mathbf {H}=\left[ \begin{array}{cc} E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\right\} &{} E\left( s\left( \mathbf {y}\right) \right) \\ E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) ^{t}\right\} &{} 1 \end{array}\right] \!=\!\mathbf {A} \end{aligned}$$

and \(\mathbf {H}\) is n.s.d if and only if for all \(\mathbf {x},c\) such that \(({\mathbf {x}},c)\ne {\mathbf {0}}\):

$$\begin{aligned} \left[ \begin{array}{cc} \mathbf {x}^{t}&c\end{array}\right] \mathbf {A}\left[ \begin{array}{c} \mathbf {x}\\ c \end{array}\right] >0 \end{aligned}$$

which the following establishes:

$$\begin{aligned}&\left[ \begin{array}{cc} \mathbf {x}^{t}&c\end{array}\right] \left[ \begin{array}{cc} E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\right\} &{} E\left\{ s\left( \mathbf {y}\right) \right\} \\ E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) ^{t}\right\} &{} 1 \end{array}\right] \left[ \begin{array}{c} \mathbf {x}\\ c \end{array}\right] \\&\quad = \left[ \begin{array}{cc} \mathbf {x}^{t}&c\end{array}\right] \left[ \begin{array}{c} E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\right\} \mathbf {x}+cE_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) \right\} \\ E_{\varvec{\theta }}\left\{ s\left( \mathbf {y}\right) ^{t}\right\} \mathbf {x}+c \end{array}\right] \\&\quad = E_{\varvec{\theta }}\left\{ \mathbf {x}^{t}s\left( \mathbf {y}\right) s\left( \mathbf {y}\right) ^{t}\mathbf {x}\right\} +2E_{\varvec{\theta }}\left\{ \mathbf {x}^{t}s\left( \mathbf {y}\right) \right\} c+c^{2}\\&\quad = E_{\varvec{\theta }}\left[ \left( s\left( \mathbf {y}\right) ^{t}\mathbf {x}+c\right) ^{2}\right] >0 \end{aligned}$$

assuming \(E_{\varvec{\theta }}\left\{ s(\mathbf {y})s(\mathbf {y})^{t}\right\} \) is p.s.d. for all \(\varvec{\theta }\).

Noise-constrative divergence approximates the Poisson transform (Theorem 7)

We have assumed that

$$\begin{aligned} f_{\varvec{\theta }}(\mathbf {y})-\log q(\mathbf {y})\le C(\varvec{\theta }) \end{aligned}$$

for a certain constant \(C(\varvec{\theta })\) that may depend on \(\varvec{\theta }\), and all \(\mathbf {y}\in \Omega \). We rewrite the log odds ratio as \(h(\mathbf {y})-\log (m)\) where

$$\begin{aligned} h(\mathbf {y}):=f_{\varvec{\theta }}(\mathbf {y})+\nu -\log q(\mathbf {y})+\log (n) \end{aligned}$$

does not depend on m; note \(h(\mathbf {y})\le \bar{h}:=C(\varvec{\theta })+\nu +\log (n)\). One has

$$\begin{aligned}&{\mathcal {R}}^{m}(\varvec{\theta },\nu )+\log (m/n)= \sum _{i=1}^{n}\log \left[ \frac{m\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} }{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} +mq(\mathbf {y}_{i})}\right] \\&\quad +\sum _{j=1}^{m}\log \left[ \frac{mq({\mathbf {r}}_{j})}{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{j})+\nu \right\} +mq({\mathbf {r}}_{j})}\right] \end{aligned}$$

where the first term trivially converges (as \(m\rightarrow +\infty \)) to

$$\begin{aligned} \sum _{i=1}^{n}\left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu -\log q(\mathbf {y}_{i})\right\} . \end{aligned}$$

Regarding the second term, one has

$$\begin{aligned}&\log \left[ \frac{mq({\mathbf {r}}_{j})}{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{j})+\nu \right\} +mq({\mathbf {r}}_{j})}\right] \\&\quad =\log \left[ 1-\frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }\right] \end{aligned}$$


$$\begin{aligned} 0\le \frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }\le \frac{1}{m}\exp (\bar{h}). \end{aligned}$$

Since \(\left| \log (1-x)+x\right| \le x^{2}\) for \(x\in [0,1/2]\), we have, for m large enough, that

$$\begin{aligned}&\left| \log \left[ \frac{mq({\mathbf {r}}_{j})}{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{j})+\nu \right\} +mq({\mathbf {r}}_{j})}\right] \right. \nonumber \\&\quad \left. +\frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }\right| \le \frac{\exp (2\bar{h})}{m^{2}} \end{aligned}$$


$$\begin{aligned} \left| \frac{1}{1+m\exp \left\{ -h({\mathbf {r}}_{j})\right\} }-\frac{1}{m}\exp \left\{ h({\mathbf {r}}_{j})\right\} \right| \le \frac{\exp (2\bar{h})}{m^{2}} \end{aligned}$$

and since, by the law of large numbers,

$$\begin{aligned}&\frac{1}{m}\sum _{j=1}^{m}\exp \left\{ h({\mathbf {r}}_{i})\right\} \rightarrow \mathbb {E}_{q}[\exp \left\{ h({\mathbf {r}}_{i})\right\} ]\nonumber \\&\quad =n\int \exp \left\{ f_{\varvec{\theta }}(\mathbf {y})+\nu \right\} \, d\mathbf {y}<+\infty \end{aligned}$$

almost surely as \(m\rightarrow +\infty \), one also has

$$\begin{aligned}&\sum _{j=1}^{m}\log \left[ \frac{mq(\mathbf {y}_{i})}{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\eta \right\} +mq(\mathbf {y}_{i})}\right] \rightarrow \\&\quad -n\int \exp \left\{ f_{\varvec{\theta }}(\mathbf {y})+\nu \right\} \, d\mathbf {y}\end{aligned}$$

almost surely, since the difference between the two sums is bounded deterministically by \(\exp (2\bar{h})/m\).

Uniform convergence of the noise-constrative divergence (Theorem 8)

We first prove two intermediate results.

Lemma 9

Assuming that \(\left| f_{\varvec{\theta }}(\mathbf {y})-\log q(\mathbf {y})\right| \le C\) for all \(\mathbf {y}\in \Omega \), then there exists a bounded interval I such that, for any \(\varvec{\theta }\), the maximum of both functions \(\nu \rightarrow \mathcal {M}(\varvec{\theta },\nu )\) and \(\nu \rightarrow {\mathcal {R}}^{m}(\varvec{\theta },\nu )\) is attained in I.


Let \(\varvec{\theta }\) some fixed value. \(\mathcal {M}(\varvec{\theta },\nu )\) is maximised at \(\nu ^{\star }(\varvec{\theta })=-\log \int _{\Omega }\exp \left\{ f_{\varvec{\theta }}(\mathbf {y})\right\} \, d\mathbf {y}\in \left[ -C,C\right] \), since \(e^{-C}q\le f_{\varvec{\theta }}\le e^{C}q\). For \({\mathcal {R}}^{m}(\varvec{\theta },\nu )\), using again \(e^{-C}q\le f_{\varvec{\theta }}\le e^{C}q\), one sees that \(l(\nu )\le {\mathcal {R}}^{m}(\varvec{\theta },\nu )\le u(\nu )\), where l and u are functions of \(\nu \) that diverges at \(-\infty \) for both \(\nu \rightarrow +\infty \) and \(\nu \rightarrow -\infty \); i.e.

$$\begin{aligned}&{\mathcal {R}}^{m}(\varvec{\theta },\nu )+\log (m/n)\le u(\nu ):= \sum _{i=1}^{n}\log \left[ \frac{m\exp \left\{ C+\nu \right\} }{n\exp (-C+\nu )+m}\right] \\&\quad +\sum _{j=1}^{m}\log \left[ \frac{m}{n\exp \left\{ -C+\nu \right\} +m}\right] \end{aligned}$$

and the lower bound \(l(\nu )\) has a similar expression. Thus, one may construct an interval J such that the maximum of function \(\nu \rightarrow {\mathcal {R}}^{m}(\varvec{\theta },\nu )\) is attained in J for all \(\varvec{\theta }\) (e.g. take J such that for \(\nu \in J^{c}, u(\nu )\le M_{l}/2, l(\nu )\le M_{l}/2\), with \(M_{l}=\sup _{\nu }l\)) . To conclude, take \(I=J\cup [-C,C]\).

We now establish uniform convergence, but, in light of the previous result, we restrict \(\nu \) to the interval I defined in Lemma 9.

Lemma 10

Under the Assumptions that (i) \(\Theta \) is bounded, that (ii) \(\left| f_{\varvec{\theta }}(\mathbf {y})-\log q(\mathbf {y})\right| \le C\) for all \(\mathbf {y}\in \Omega \), that (iii) \(\left| f_{\varvec{\theta }}(\mathbf {y})-f_{\varvec{\theta }'}(\mathbf {y})\right| \le \kappa (\mathbf {y})\left\| \varvec{\theta }-\varvec{\theta }'\right\| \) with \(\kappa \) such that \(\mathbb {E}_{q}[\kappa ]<\infty \), one has, for fixed \({\mathcal {S}}=\left\{ \mathbf {y}_{1},\ldots ,\mathbf {y}_{n}\right\} \):

$$\begin{aligned}&\sup _{\left( \varvec{\theta },\nu \right) \in \Theta \times I}\left| {\mathcal {R}}^{m}(\varvec{\theta },\nu )+\log (m/n)+\sum _{i=1}^{n}\log q({\mathbf {y}}_{i})\right. \nonumber \\&\left. \quad -\mathcal {M}(\varvec{\theta },\nu )\right| \rightarrow 0 \end{aligned}$$

almost surely, relative to the randomness induced by \({\mathcal {R}}=\left\{ {\mathbf {r}}_{1},\ldots ,{\mathbf {r}}_{m}\right\} .\)


Recall that the absolute difference above was bounded by the sum of three terms in the previous Appendix. The first term was

$$\begin{aligned}&\sum _{i=1}^{n}\left[ \log \left[ \frac{m\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} }{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} +mq(\mathbf {y}_{i})}\right] \right. \\&\quad \left. -\left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu -\log q(\mathbf {y}_{i})\right\} \right] \end{aligned}$$

which clearly converges deterministically to 0 as \(m\rightarrow +\infty .\) In addition, this convergence is uniform with respect to \(\left( \varvec{\theta },\nu \right) \in \Theta \times I\), since \(\left| \log x-\log y\right| \le c\left| x-y\right| \) for \(x,y\ge 1/c\), and here, by Assumption (ii),

$$\begin{aligned} x:= & {} \frac{m\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} }{n\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu \right\} +mq(\mathbf {y}_{i})}\ge \frac{m\exp \left\{ -C+\nu \right\} }{n\exp \left\{ C+\nu \right\} +m}\\\ge & {} \exp \left\{ -C+\nu \right\} \end{aligned}$$

and \(y=\exp \left\{ f_{\varvec{\theta }}(\mathbf {y}_{i})+\nu -\log q(\mathbf {y}_{i})\right\} \ge \exp \left\{ -C+\nu \right\} \), so both x and y are lower bounded since \(\nu \in I\). Similarly \((x-y)\) is bounded by \(C'/m\), where \(C'\) is some constant independent of \(\varvec{\theta }\).

The second term, see (5.1), was bounded by \(\exp (2\bar{h})/m^{2}\), where \(\bar{h}\), an upper bound of h, may now be replaced by a constant, since \(h(\mathbf {y}):=f_{\varvec{\theta }}(\mathbf {y})+\nu -\log q(\mathbf {y})+\log (n)\le C+\nu +\log (n)\) and \(\nu \in I\), again by Assumption (ii).

The third term is related to the law of large numbers (5.2) for random variable \(H_{(\varvec{\theta },\eta )}({\mathbf {r}}_{i}):=\exp \left\{ h({\mathbf {r}}_{i})\right\} \), which depended implicitly on \(\left( \varvec{\theta },\eta \right) \):

$$\begin{aligned} H_{(\varvec{\theta },\eta )}({\mathbf {r}}_{i})=\frac{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}}_{i})+\nu \right\} }{q({\mathbf {r}}_{i})}. \end{aligned}$$

To obtain (almost surely) uniform convergence, we use the generalised version of the Glivenko-Cantelli theorem; e.g. Theorem 19.4 p. 270 in Van der Vaart (2000). From Example 19.7 of the same book, one sees that a sufficient condition in our case is that \(\Theta \) is bounded [Assumption (i)], and that

$$\begin{aligned} \left| H_{(\varvec{\theta },\eta )}({\mathbf {r}})-H_{(\varvec{\theta }',\eta ')}({\mathbf {r}})\right| \le m({\mathbf {r}})\left\| \varvec{\xi }-\varvec{\xi }'\right\| \end{aligned}$$

for \(\varvec{\xi }=(\varvec{\theta },\eta ), \varvec{\xi }'=(\varvec{\theta }',\eta ')\), and m a function such that \(\mathbb {E}_{q}[m]<\infty \). But

$$\begin{aligned}&\left| H_{(\varvec{\theta },\eta )}({\mathbf {r}})-H_{(\varvec{\theta }',\eta ')}({\mathbf {r}})\right| =\frac{n\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}})+\nu \right\} }{q({\mathbf {r}})}\\&\quad \left| 1-\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}})+\nu -f_{\varvec{\theta }'}({\mathbf {r}})-\nu '\right\} \right| \\&\quad \le ne^{C+\nu }\left| 1-\exp \left\{ f_{\varvec{\theta }}({\mathbf {r}})+\nu -f_{\varvec{\theta }'}({\mathbf {r}})-\nu '\right\} \right| \\&\quad \le C'\left\{ \kappa ({\mathbf {r}})\left\| \varvec{\theta }-\varvec{\theta }'\right\| +\left| \nu -\nu '\right| \right\} \\&\quad \le C'\left\{ \kappa ({\mathbf {r}})+1\right\} \left\| \varvec{\xi }-\varvec{\xi }'\right\| \end{aligned}$$

by Assumption (ii), and for some constant \(C'\) independent of \(\varvec{\theta }\), since \(\left| 1-e^{x}\right| \le Kx\) for xy in a bounded set. One may conclude, since, by Assumption (ii), \(\mathbb {E}_{q}[\kappa ]<\infty \).

We are now able to prove Theorem 8. Again, let \(\varvec{\xi }=(\varvec{\theta },\nu )\), and rewrite any function of \((\varvec{\theta },\nu )\) as a function of \(\varvec{\xi }\), i.e. \(\mathcal {M}(\varvec{\xi }), {\mathcal {R}}^{m}(\varvec{\xi })\). By e.g. Theorem 5.7 p. 45 of Van der Vaart (2000), the uniform convergence 5.3 implies that that the maximiser \(\hat{\varvec{\xi }}^{m}\) of \({\mathcal {R}}^{m}(\varvec{\theta },\nu )\) converges to the maximiser \(\hat{\varvec{\xi }}\) of \(\mathcal {M}(\varvec{\theta },\nu )\), provided that (a) the maximisation is with respect to \(\left( \varvec{\theta },\nu \right) \in \Theta \times I\); and (b) that \(\sup _{d(\varvec{\xi },\hat{\varvec{\xi }})\ge \epsilon }\mathcal {M}(\varvec{\xi })<\mathcal {M}(\hat{\varvec{\xi }})\). However, by Lemma 9 one sees that in (a) the same estimators would be obtained by maximising instead with respect to \(\left( \varvec{\theta },\nu \right) \in \Theta \times \mathbb {R}\), and (b) is a direct consequence of Assumption (iv) of the theorem, if one takes for \(d(\varvec{\xi },\hat{\varvec{\xi }})\) the supremum norm of \(\varvec{\xi }-\hat{\varvec{\xi }}\).

Additional information on the application

In our application we fit a spatial Markov chain model using logistic regression. Since the procedure involves the generation of a random set of reference points, we incur some Monte Carlo error in the estimates. Estimating the magnitude of the Monte Carlo error is just a matter of running the procedure several times to look at variability in the estimates. We did so over five repetitions and report the results in Fig. 5. For each repetition we plot the estimated smooth effect of saccade angle \(r_{\mathrm{ang}}\), along with a 95% confidence band. Since smoothing splines are used, smoothing hyperparameters had to be inferred from the data (using REML, Wood 2011), and the reported confidence band is conditional on the estimated value of the smoothing hyperparameters. The fits and confidence bands are extremely stable over independent repetitions. The R command we used was

Fig. 5

Eye movement model: 5 independent replications of the estimates under different sets of random reference points. We show here the estimated effect of saccade angle with an associated 95%  pointwise confidence interval. The 5 replicates are in different colours and overlap each other almost completely, showing that 20 reference points per true datapoint are more than enough to produce stable estimates. (Color figure online)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Barthelmé, S., Chopin, N. The Poisson transform for unnormalised statistical models. Stat Comput 25, 767–780 (2015).

Download citation


  • Logistic Regression
  • Point Process
  • Normalisation Constant
  • Markov Chain Model
  • Bregman Divergence