Heterogeneous multi-task Gaussian Cox processes

Zhou, Feng; Kong, Quyu; Deng, Zhijie; He, Fengxiang; Cui, Peng; Zhu, Jun

doi:10.1007/s10994-023-06382-1

Heterogeneous multi-task Gaussian Cox processes

Published: 08 September 2023

Volume 112, pages 5105–5134, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

Feng Zhou^1,2,
Quyu Kong³,
Zhijie Deng^2,4,
Fengxiang He⁵,
Peng Cui² &
…
Jun Zhu²

330 Accesses
1 Altmetric
Explore all metrics

Abstract

This paper presents a novel extension of multi-task Gaussian Cox processes for modeling multiple heterogeneous correlated tasks jointly, e.g., classification and regression, via multi-output Gaussian processes (MOGP). A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks, while allowing for nonparametric parameter estimation. To circumvent the non-conjugate Bayesian inference in the MOGP modulated heterogeneous multi-task framework, we employ the data augmentation technique and derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters. We demonstrate the performance and inference on both 1D synthetic data as well as 2D urban data of Vancouver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gradient Descent for Gaussian Processes Variance Reduction

Recent Advances in Scaling Up Gaussian Process Predictive Models for Large Spatiotemporal Data

Correlated product of experts for sparse Gaussian process regression

Article Open access 25 January 2023

Data Availability

Experiments were done on the public dataset.

Code Availability

Code is public.

Notes

The notion of data augmentation in statistics is different from that in deep learning.
We focus on binary classification here. Extension to multi-class classification is discussed in Sect. D.
For the compactness of notation, the task index i is sometimes moved from subscript to superscript, which does not cause confusion because we use i consistently.
The income, education and non-market housing data is from the Vancouver Open Data Catalog (https://opendata.vancouver.ca/pages/home/). The crime data is from Kaggle (https://www.kaggle.com/datasets/wosaku/crime-in-vancouver).

References

Adams, R.P., Murray, I., MacKay, D.J. (2009). Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16.
Aglietti, V., Damoulas, T., Bonilla, E.V. (2019). Efficient inference in multi-task Cox process models. In: The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, pp 537–546.
Alvarez, M.A., Lawrence, N.D. (2008). Sparse convolved Gaussian processes for multi-output regression. In: NIPS, pp 57–64.
Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3), 195–266.
Article MATH Google Scholar
Álvarez, M.A., Ward, W., Guarnizo, C. (2019). Non-linear process convolutions for multi-output Gaussian processes. In: The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, pp 1969–1977.
Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
Article Google Scholar
Besag, J. (1994). Discussion on the paper by Grenander and Miller. Journal of the Royal Statistical Society: Series B (Methodological), 56, 591–592.
Google Scholar
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Cham: Springer.
MATH Google Scholar
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.
Article MathSciNet Google Scholar
Bonilla, E.V., Chai, K.M.A., Williams, C.K.I. (2007). Multi-task Gaussian process prediction. In: Platt JC, Koller D, Singer Y, et al (eds) Advances in neural information processing systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. Curran Associates, Inc., pp 153–160.
Cunningham, J.P., Shenoy, K.V., Sahani, M. (2008). Fast Gaussian process methods for point process intensity estimation. In: International Conference on Machine Learning, ACM, pp 192–199.
Daley, D.J., Vere-Jones, D. (2003). An introduction to the theory of point processes. Vol. I. Probability and its applications.
Dezfouli, A., & Bonilla, E. V. (2015). Scalable inference for Gaussian process models with black-box likelihoods. Advances in Neural Information Processing Systems, 28, 1414–1422.
Google Scholar
Diggle, P. J., Moraga, P., Rowlingson, B., et al. (2013). Spatial and spatio-temporal log-Gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science, 28(4), 542–563.
Article MathSciNet MATH Google Scholar
Donner, C., & Opper, M. (2018). Efficient Bayesian inference of sigmoidal Gaussian Cox processes. Journal of Machine Learning Research, 19(1), 2710–2743.
MATH Google Scholar
Galy-Fajou, T., Wenzel, F., Donner, C., et al. (2020). Multi-class Gaussian process classification made conjugate: Efficient inference via data augmentation. In: Uncertainty in Artificial Intelligence, PMLR, pp 755–765.
Hensman, J., Matthews, A., Ghahramani, Z. (2015). Scalable variational Gaussian process classification. In: Artificial Intelligence and Statistics, PMLR, pp 351–360.
Hoffman, M.D., Blei, D.M,. Wang, C., et al. (2013). Stochastic variational inference. Journal of Machine Learning Research 14(5).
Jahani, S., Zhou, S., Veeramani, D., et al. (2021). Multioutput Gaussian process modulated Poisson processes for event prediction. IEEE Transactions on Reliability.
Journel, A. G., & Huijbregts, C. J. (1976). Mining geostatistics. London: Academic Press.
Google Scholar
Lasko, T.A. (2014). Efficient inference of Gaussian-process-modulated renewal processes with application to medical event data. In: Uncertainty in artificial intelligence: Proceedings of the Conference on Uncertainty in Artificial Intelligence, NIH Public Access, p 469.
Li, C., Zhu, J., Chen, J. (2014). Bayesian max-margin multi-task learning with data augmentation. In: International Conference on Machine Learning, PMLR, pp 415–423.
Lian, W., Henao, R., Rao, V., et al. (2015). A multitask point process predictive model. In: International Conference on Machine Learning, PMLR, pp 2030–2038.
Lloyd, C., Gunter, T., Osborne, M., et al. (2015). Variational inference for Gaussian process modulated Poisson processes. In: International Conference on Machine Learning, pp 1814–1822.
Møller, J., Syversveen, A. R., & Waagepetersen, R. P. (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3), 451–482.
Article MathSciNet MATH Google Scholar
Moreno-Muñoz, P., Artés, A., Álvarez, M. (2018). Heterogeneous multi-output Gaussian process prediction. Advances in Neural Information Processing Systems 31
Mutny, M., Krause, A. (2021). No-regret algorithms for capturing events in Poisson point processes. In: International Conference on Machine Learning, PMLR, pp 7894–7904.
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science: University of Toronto Toronto, ON, Canada.
Nguyen, T. V., & Bonilla, E. V. (2014). Automated variational inference for Gaussian process models. Advances in Neural Information Processing Systems, 27, 1404–1412.
Google Scholar
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504), 1339–1349.
Article MathSciNet MATH Google Scholar
Rasmussen, C.E. (2003). Gaussian processes in machine learning. In: Summer School on Machine Learning, Springer, pp 63–71.
Shirota, S., Gelfand, A.E. (2017) Space and circular time log Gaussian Cox processes with application to crime event data. The Annals of Applied Statistics pp 481–503.
Snell, J., Zemel, R.S. (2021). Bayesian few-shot classification with one-vs-each Pólya-Gamma augmented Gaussian processes. In: International Conference on Learning Representations, ICLR 2021. OpenReview.net.
Soleimani, H., Hensman, J., & Saria, S. (2017). Scalable joint models for reliable uncertainty-aware event prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(8), 1948–1963.
Article Google Scholar
Taylor, B. M., Davies, T. M., Rowlingson, B. S., et al. (2015). Bayesian inference and data augmentation schemes for spatial, spatiotemporal and multivariate log-Gaussian Cox processes in R. Journal of Statistical Software, 63(1), 1–48.
Google Scholar
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp 567–574.
Uspensky, J. V., et al. (1937). Introduction to mathematical probability. New York: McGraw-Hill Book Co., Inc.
MATH Google Scholar
Ver Hoef, J. M., & Barry, R. P. (1998). Constructing and fitting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69(2), 275–294.
Article MathSciNet MATH Google Scholar
Wenzel, F., Galy-Fajou, T., Donner, C., et al. (2019). Efficient Gaussian process classification using Pólya-Gamma data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 5417–5424.
Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2). MIT Press.
MATH Google Scholar
Wood, F., Meent, J.W., Mansinghka, V. (2014). A new approach to probabilistic programming inference. In: Artificial intelligence and statistics, PMLR, pp 1024–1032.
Zhou, F., Li, Z., Fan, X., et al. (2020). Efficient inference for nonparametric Hawkes processes using auxiliary latent variables. Journal of Machine Learning Research, 21(241), 1–31.
MathSciNet MATH Google Scholar
Zhou, F., Zhang, Y., Zhu, J. (2021). Efficient inference of flexible interaction in spiking-neuron networks. In: International Conference on Learning Representations.
Zhou, F., Kong, Q., Deng, Z., et al. (2022). Efficient inference for dynamic flexible interactions of neural populations. Journal of Machine Learning Research, 23(211), 1–49.
MathSciNet Google Scholar

Download references

Funding

This work was conceived and conducted when FZ was doing post-doc at Tsinghua University. This work was supported by NSFC Projects (Nos. 62061136001, 62106121, 61621136008, U19A2081), the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD110001), the National Key Research and Development Program of China (2021ZD0110502), the major key project of PCL (No. PCL2021A12), the fund for building world-class universities (disciplines) of Renmin University of China (No. KYGJC2023012), and Tsinghua Guo Qiang Institute, and the High Performance Computing Center, Tsinghua University, and the Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
Feng Zhou
Department of Computer Science and Technology, BNRist Center, THU-Bosch Joint ML Center, Tsinghua University, Beijing, China
Feng Zhou, Zhijie Deng, Peng Cui & Jun Zhu
Data Science Institute, University of Technology Sydney, Sydney, Australia
Quyu Kong
Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China
Zhijie Deng
JD Explore Academy, JD.com Inc, Beijing, China
Fengxiang He

Authors

Feng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Quyu Kong
View author publications
You can also search for this author in PubMed Google Scholar
Zhijie Deng
View author publications
You can also search for this author in PubMed Google Scholar
Fengxiang He
View author publications
You can also search for this author in PubMed Google Scholar
Peng Cui
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have contributed significantly to this work. The theoretical derivation was primarily performed by FZ, and the experimental validation was primarily performed by FZ, QK and ZD with support from FH and PC. All authors wrote the manuscript. JZ supervised the whole project.

Corresponding author

Correspondence to Jun Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Willem Waegeman.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proof of augmented likelihood for classification

Substituting Eq. (2) in the paper into the classification likelihood Eq. (1b) in the paper, we can obtain

$$\begin{aligned} p(\textbf{y}^{c}\mid \{g_{i}^c\}_{i=1}^{I_c})=\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_i}\int _0^\infty e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0)d\omega ^c_{i,n}, \end{aligned}$$

(A1)

where the integrand is the augmented likelihood:

$$\begin{aligned} p(\textbf{y}^c,\varvec{\omega }^c\mid \{g^c_i\}_{i=1}^{I_c})=\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_i} e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0). \end{aligned}$$

(A2)

Appendix B. Proof of augmented likelihood for Cox process

Substituting Eqs. (2) and (4) in the paper into the product and exponential integral terms respectively in the Cox process likelihood Eq. (1c) in the paper, we can obtain

$$\begin{aligned} \begin{aligned}&p(\textbf{x}^p\mid \{\bar{\lambda }_i,g^p_i\}_{i=1}^{I_p})=\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\int _0^\infty \Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}d\omega ^p_{i,n}\\&\int _{\mathcal {X}}\int _0^\infty p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}d\omega d\textbf{x}, \end{aligned} \end{aligned}$$

(B3)

where the integrand is the augmented likelihood:

$$\begin{aligned} \begin{aligned}&p(\textbf{x}^p,\varvec{\omega }^p,\Pi \mid \bar{\varvec{\lambda }}, \{g^p_i\}_{i=1}^{I_p})=\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\\&\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}. \end{aligned} \end{aligned}$$

(B4)

Appendix C. Proof of mean-field approximation

The augmented joint distribution can be written as:

$$\begin{aligned} \begin{aligned}&p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})\\ =&\underbrace{p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})}_{\text {regression}}\underbrace{p(\textbf{y}^c,\varvec{\omega }^c\mid \{g^c_i\}_{i=1}^{I_c})}_{\text {augmented classification}}\underbrace{p(\textbf{x}^p,\varvec{\omega }^p,\Pi \mid \bar{\varvec{\lambda }}, \{g^p_i\}_{i=1}^{I_p})}_{\text {augmented Cox process}}\underbrace{p(g)}_{\text {MOGP}}p(\bar{\varvec{\lambda }})\\ =&\prod _{i=1}^{I_r}\prod _{n=1}^{N_{i}^r}\mathcal {N}(y^r_{i,n}\mid g^r_{i,n},\sigma _i^2)\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_i}e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0)\\&\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}p(g)p(\bar{\varvec{\lambda }}). \end{aligned} \end{aligned}$$

(C5)

Here, we assume the variational posterior $q(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})=q_1(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )q_2(g,\bar{\varvec{\lambda }})$. To minimize the KL divergence between variational posterior and true posterior, it can be proved that the optimal distribution of each factor is the expectation of the logarithm of the joint distribution taken over variables in the other factor (Bishop, 2006):

$$\begin{aligned} \begin{aligned} q_1^*(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )&\propto e^{{\mathbb {E}_{q_2}[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]}},\\ q_2^*(g,\bar{\varvec{\lambda }})&\propto e^{{\mathbb {E}_{q_1}[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]}}. \end{aligned} \end{aligned}$$

(C6)

Substituting Eq. (C5) into Eq. (C6), we can obtain the optimal variational distributions. The process of deriving variational posteriors for $\varvec{\omega }^c$, $\varvec{\omega }^p$, $\Pi$, and $\bar{\varvec{\lambda }}$ is similar to that in Donner and Opper (2018). The primary distinction lies in the treatment of the latent function g. Further details are provided below.

1.1 The optimal density for Pólya-Gamma latent variables

The optimal variational posteriors of $\varvec{\omega }^c$ and $\varvec{\omega }^p$ are

$$\begin{aligned} \begin{aligned} q_1(\varvec{\omega }^c)=\prod _{i=1}^{I_c}\prod _{n=1}^{N_i^c}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,\tilde{g}^c_{i,n}),\ \ \ \ q_1(\varvec{\omega }^p)=\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}p_{\text {PG}}(\omega ^p_{i,n}\mid 1,\tilde{g}^p_{i,n}), \end{aligned} \end{aligned}$$

(C7)

where $\tilde{g}^\cdot _{i,n}=\sqrt{\mathbb {E}[{g^\cdot _{i,n}}^2]}$ and we adopt the tilted Pólya-Gamma distribution $p_{\text {PG}}(\omega \mid b,c)\propto e^{-c^2\omega /2}p_{\text {PG}}(\omega \mid b,0)$ (Polson et al., 2013).

1.2 The optimal intensity for marked Poisson processes

The derivation of optimal variational posterior of $\Pi =\{\Pi _i\}_{i=1}^{I_p}$ is challenging, so we provide some details below. After taking expectation, we can obtain

$$\begin{aligned} \begin{aligned} q_1(\Pi _i) =&\frac{p_{\tilde{\Lambda }_i}(\Pi _i\mid \bar{\lambda }^1_i)\prod _{(\omega ,\textbf{x}) \in \Pi _i}e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2} \omega -\log 2}}{\iint p_{\tilde{\Lambda }_i}(\Pi _i\mid \bar{\lambda }^1_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}d\omega d\textbf{x}}\\ =&p_{\tilde{\Lambda }_i}(\Pi _i\mid \bar{\lambda }^1_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}\\&\exp {\left( \int _{\mathcal {X}}\int _0^\infty (1-e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2} -\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2})\bar{\lambda }^1_i p_{\text {PG}} (\omega \mid 1,0)d\omega d\textbf{x}\right) }\\ =&\prod _{(\omega ,\textbf{x})\in \Pi _i}\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0) e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}\\&\exp {\left( -\int _{\mathcal {X}}\int _0^\infty \bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2} \omega -\log 2}d\omega d\textbf{x}\right) }, \end{aligned} \end{aligned}$$

(C8)

where $\bar{\lambda }^1_i=e^{\mathbb {E}[\log \bar{\lambda }_i]}$ and $\tilde{\Lambda }_i(\textbf{x},\omega )=\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)$. The second line of Eq. (C8) used Campbell’s theorem $\mathbb {E}_{\Pi _i}\left[ \exp {\left( \sum _{(\textbf{x},\omega )\in \Pi _i}h(\textbf{x},\omega )\right) }\right] =\exp {\left[ \iint \left( e^{ h(\textbf{x},\omega )}-1\right) \tilde{\Lambda }_i(\textbf{x},\omega )d\omega d\textbf{x}\right] }$. It is easy to see the posterior intensity of $\Pi _i$ is

$$\begin{aligned} \begin{aligned} \Lambda _i^1(\textbf{x},\omega )&=\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}\\&=\bar{\lambda }_i^1 s(-\tilde{g}^p_i(\textbf{x}))p_{\text {PG}}(\omega \mid 1,\tilde{g}^p_i(\textbf{x}))e^{(\tilde{g}^p_i(\textbf{x})-\bar{g}^p_i(\textbf{x}))/2}, \end{aligned} \end{aligned}$$

(C9)

where we adopt $e^{-c^2\omega /2}p_{\text {PG}}(\omega \mid b,0)=2\,s(-c)e^{c/2}p_{\text {PG}}(\omega \mid b,c)$ (Polson et al., 2013), $\tilde{g}_i^p(\textbf{x})=\sqrt{\mathbb {E}[{g_i^p(\textbf{x})}^2]}$, $\bar{g}_i^p(\textbf{x})=\mathbb {E}[g_i^p(\textbf{x})]$.

1.3 The optimal density for intensity upper-bounds

The optimal variational posterior of $\bar{\varvec{\lambda }}$ is

$$\begin{aligned} \begin{aligned} q_2(\bar{\varvec{\lambda }})=\prod _{i=1}^{I_p}p_{\text {Ga}}(\bar{\lambda }_i\mid N_i^p+R_i,|\mathcal {X}|), \end{aligned} \end{aligned}$$

(C10)

where $p_{\text {Ga}}$ is Gamma density, $R_i=\int _\mathcal {X}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega d\textbf{x}$, $|\mathcal {X}|$ is the domain size.

1.4 The optimal density for latent functions

The derivation of optimal variational posterior of g is challenging, so we provide some details below. After taking expectation, we can obtain

$$\begin{aligned} \begin{aligned}&\log q_2(g)=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(y_{i,n}^r\mid g_{i,n}^r,\sigma _i^2)+\sum _{i=1}^{I_c}\sum _{n=1}^{N_i^c}\left[ \frac{y_{i.n}^c g_{i,n}^c}{2}-\frac{{g_{i,n}^c}^2}{2}\mathbb {E}[\omega _{i,n}^c]\right] +\\&\sum _{i=1}^{I_p}\left[ \sum _{n=1}^{N_i^p}\frac{g_{i,n}^p}{2}-\frac{{g_{i,n}^p}^2}{2}\mathbb {E}[\omega _{i,n}^p]-\mathbb {E}_{\Pi _i}\sum _{(\omega ,\textbf{x})\in \Pi _i}\frac{g_i^p(\textbf{x})}{2}+\frac{{g_i^p(\textbf{x})}^2}{2}\omega \right] +\log p(g)+C\\&=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(g_{i,n}^r\mid y_{i,n}^r,\sigma _i^2)+\sum _{i=1}^{I_c}\sum _{n=1}^{N_i^c}\log \mathcal {N}(g_{i,n}^c\mid \frac{y_{i,n}^c}{2\mathbb {E}[\omega _{i,n}^c]},\frac{1}{\mathbb {E}[\omega _{i,n}^c]})\\&+\sum _{i=1}^{I_p}\left[ \int _{\mathcal {X}}\sum _{n=1}^{N_i^p}\left( \frac{g_{i}^p(\textbf{x})}{2} -\frac{{g_{i}^p(\textbf{x})}^2}{2}\mathbb {E}[\omega _{i,n}^p]\right) \delta (\textbf{x}-\textbf{x}^p_{i,n}) d\textbf{x}\right. \\&\left. -\int _{\mathcal {X}}\int _0^\infty \left( \frac{g_i^p(\textbf{x})}{2}+\frac{{g_i^p(\textbf{x})}^2}{2} \omega \right) \Lambda _i^1(\textbf{x},\omega )d\omega d\textbf{x}\right] +\log p(g)+C\\&=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(g_{i,n}^r\mid y_{i,n}^r,\sigma _i^2)+\sum _{i=1}^{I_c} \sum _{n=1}^{N_i^c}\log \mathcal {N}(g_{i,n}^c\mid \frac{y_{i,n}^c}{2\mathbb {E}[\omega _{i,n}^c]},\frac{1}{\mathbb {E}[\omega _{i,n}^c]})\\&+\sum _{i=1}^{I_p}\left[ \int _{\mathcal {X}}\left( \frac{1}{2}\sum _{n=1}^{N^p_i}\delta (\textbf{x} -\textbf{x}^p_{i,n})-\frac{1}{2}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega \right) g_i^p (\textbf{x})d\textbf{x}+\log p(g)\right. \\&\left. -\frac{1}{2}\int _{\mathcal {X}}\left( \sum _{n=1}^{N_i^p}\mathbb {E}[\omega _{i,n}^p]\delta (\textbf{x}-\textbf{x}^p_{i,n})+\int _0^\infty \omega \Lambda _i^1(\textbf{x},\omega )d\omega \right) {g_i^p(\textbf{x})}^2 d\textbf{x}\right] +C\\&=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(g_{i,n}^r\mid y_{i,n}^r,\sigma _i^2) +\sum _{i=1}^{I_c}\sum _{n=1}^{N_i^c}\log \mathcal {N}(g_{i,n}^c\mid \frac{y_{i,n}^c}{2\mathbb {E} [\omega _{i,n}^c]},\frac{1}{\mathbb {E}[\omega _{i,n}^c]})\\&+\sum _{i=1}^{I_p}\left[ \int _{\mathcal {X}} B_i(\textbf{x})g_i^p(\textbf{x})d\textbf{x}-\frac{1}{2}\int _{\mathcal {X}} A_i(\textbf{x}){g_i^p(\textbf{x})}^2 d\textbf{x}\right] +\log p(g)+C, \end{aligned} \end{aligned}$$

(C11)

where $A_i(\textbf{x})=\sum _{n=1}^{N^p_i}\mathbb {E}[\omega _{i,n}^p]\delta (\textbf{x}-\textbf{x}^p_{i,n})+\int _0^\infty \omega \Lambda _i^1(\textbf{x},\omega )d\omega$ and $B_i(\textbf{x})=\frac{1}{2}\sum _{n=1}^{N^p_i}\delta (\textbf{x}-\textbf{x}^p_{i,n})-\frac{1}{2}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega$.

The computation of Eq. (C11) suffers from a cubic complexity w.r.t. the number of data points in regression, classification and point process tasks. We use the inducing inputs formalism to make the inference scalable. We denote M inducing inputs $[\textbf{x}_1\,\ldots ,\textbf{x}_M]^\top$ on the domain $\mathcal {X}$ for each task. The function values of basis function $f_q$ at these inducing inputs are defined as $\textbf{f}_{q,\textbf{x}_m}$. Then we can obtain the function values of task-specific latent function $g_i$ at these inducing inputs $\textbf{g}_{\textbf{x}_m}^i=\sum _{q=1}^Qw_{i,q}\textbf{f}_{q,\textbf{x}_m}$. If we define $\textbf{g}_{\textbf{x}_m}=[\textbf{g}^{\top }_{1,\textbf{x}_m},\ldots ,\textbf{g}^{\top }_{I,\textbf{x}_m}]^\top$, $\textbf{g}_{\textbf{x}_m}\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m})$ where $\textbf{K}_{\textbf{x}_m\textbf{x}_m}$ is the MOGP covariance on $\textbf{x}_m$ for all tasks and $\textbf{g}_{\textbf{x}_m}^i\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m}^i)$ where $\textbf{K}_{\textbf{x}_m\textbf{x}_m}^i$ is i-th diagonal block of $\textbf{K}_{\textbf{x}_m\textbf{x}_m}$. Given $\textbf{g}_{\textbf{x}_m}^i$, we assume the function $g_i(\textbf{x})$ is the posterior mean function $g_i(\textbf{x})=\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{g}_{\textbf{x}_m}^i$ where $\textbf{k}_{\textbf{x}_m \textbf{x}}^{i}$ is the kernel w.r.t. inducing points and predictive points for i-th task. Therefore, $\{g_{i,n}^r\}_{i=1}^{I_r}$, $\{g_{i,n}^c\}_{i=1}^{I_c}$ and $\{g_i^p(\textbf{x})\}_{i=1}^{I_p}$ can be written as

$$\begin{aligned} \begin{aligned} \textbf{g}_{i}^r=\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{g}_{\textbf{x}_m}^{r,i},\textbf{g}_{i}^c=\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{g}_{\textbf{x}_m}^{c,i}, g_i^p(\textbf{x})=\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\textbf{g}_{\textbf{x}_m}^{p,i}, \end{aligned} \end{aligned}$$

(C12)

where $\textbf{g}_{i}^r=[g_{i,1}^r,\ldots ,g_{i,N_i^r}^r]^\top$, $\textbf{g}_{i}^c=[g_{i,1}^c,\ldots ,g_{i,N_i^c}^c]^\top$, $g_i^p(\textbf{x})$ is the function value of $g_i^p$ on $\textbf{x}$.

Substituting Eq. (C12) into Eq. (C11), we obtain the inducing points version of Eq. (C11):

$$\begin{aligned} \begin{aligned}&q_2(\textbf{g}_{\textbf{x}_m})\propto \prod _{i=1}^{I_r}\mathcal {N}(\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{g}_{\textbf{x}_m}^{r,i}\mid \textbf{y}_i^r,\text {diag}(\sigma _i^2))\\&\cdot \prod _{i=1}^{I_c}\mathcal {N}(\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{g}_{\textbf{x}_m}^{c,i}\mid \frac{\textbf{y}_{i}^c}{2\mathbb {E}[\varvec{\omega }_{i}^c]},\text {diag}(\frac{1}{\mathbb {E}[\varvec{\omega }_{i}^c]}))\\&\cdot \prod _{i=1}^{I_p}\exp \left( \int _{\mathcal {X}} B_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top } d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\textbf{g}_{\textbf{x}_m}^{p,i}\right. \\&\left. -\frac{1}{2}\textbf{g}_{\textbf{x}_m}^{p,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\int _{\mathcal {X}} A_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i}\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top } d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\textbf{g}_{\textbf{x}_m}^{p,i}\right) \\&\cdot \mathcal {N}(\textbf{g}_{\textbf{x}_m}\mid \textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m}). \end{aligned} \end{aligned}$$

(C13)

It is easy to see the third line of Eq. (C13) is a multivariate Gaussian distribution of $\textbf{g}_{\textbf{x}_m}^{p,i}$. The likelihoods of $\textbf{g}_{\textbf{x}_m}^{r,i}$ for regression, $\textbf{g}_{\textbf{x}_m}^{c,i}$ for classification and $\textbf{g}_{\textbf{x}_m}^{p,i}$ for point process tasks are all Gaussian distributions, so they are conjugate to the MOGP prior and we can obtain the closed-form variational posterior for $\textbf{g}_{\textbf{x}_m}$:

$$\begin{aligned} \begin{aligned} q_2(\textbf{g}_{\textbf{x}_m})=\mathcal {N}(\textbf{g}_{\textbf{x}_m}\mid \textbf{m}_{\textbf{x}_m}, \mathbf {\Sigma }_{\textbf{x}_m}), \end{aligned} \end{aligned}$$

(C14)

where $\textbf{g}_{\textbf{x}_m}=[\textbf{g}_{\textbf{x}_m}^{r\top },\textbf{g}_{\textbf{x}_m}^{c\top }, \textbf{g}_{\textbf{x}_m}^{p\top }]^\top$, $\textbf{g}_{\textbf{x}_m}^{\cdot }=[\textbf{g}_{1,\textbf{x}_m}^{\cdot \top },\ldots , \textbf{g}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top$ and

$$\begin{aligned} \mathbf {\Sigma }_{\textbf{x}_m}=\left[ \text {diag}\left( \textbf{H}_{\textbf{x}_m}^r, \textbf{H}_{\textbf{x}_m}^c,\textbf{H}_{\textbf{x}_m}^p\right) +\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{-1}\right] ^{-1},\textbf{m}_{\textbf{x}_m}=\mathbf {\Sigma }_{\textbf{x}_m} [\textbf{v}_{\textbf{x}_m}^{r\top },\textbf{v}_{\textbf{x}_m}^{c\top },\textbf{v}_{\textbf{x}_m}^{p\top }]^\top , \end{aligned}$$

where $\textbf{H}_{\textbf{x}_m}^\cdot =\text {diag}(\textbf{H}_{1,\textbf{x}_m}^\cdot ,\ldots , \textbf{H}_{I_\cdot ,\textbf{x}_m}^\cdot )$, $\textbf{v}_{\textbf{x}_m}^\cdot =[\textbf{v}_{1,\textbf{x}_m}^{\cdot \top },\ldots , \textbf{v}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top$ and

$$\begin{aligned} \begin{aligned} \textbf{H}_{i,\textbf{x}_m}^r=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i}\textbf{D}^r_i\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}, \textbf{v}_{i,\textbf{x}_m}^{r}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}} \textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i}\frac{\textbf{y}^r_i}{\sigma _i^2},\\ \textbf{H}_{i,\textbf{x}_m}^c=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i}\textbf{D}^c_i\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}, \textbf{v}_{i,\textbf{x}_m}^{c}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}} \textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i}\frac{\textbf{y}^c_i}{2},\\ \textbf{H}_{i,\textbf{x}_m}^p=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}} \int _{\mathcal {X}}A_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i} \textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top }d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}},\\ \textbf{v}_{i,\textbf{x}_m}^{p}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\int _{\mathcal {X}}B_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i}d\textbf{x}, \end{aligned} \end{aligned}$$

where $\textbf{D}^r_i=\text {diag}(1/\sigma _i^2)$ and $\textbf{D}^c_i=\text {diag}(\mathbb {E}[\varvec{\omega }^{c}_i])$.

Appendix D. Multi-class classification

In the paper, we mainly focus on the binary classification problem because each binary classification task corresponds to a single latent function. This setting is consistent with the regression and point process tasks in which each task only specifies a single latent function.

For Z-class classification problem, each task corresponds to Z latent functions. The usual likelihood for multi-class classification is the softmax function:

$$\begin{aligned} \begin{aligned} p(y_{i,n}^c=k\mid \textbf{f}_{i,n}^c)=\frac{e^{(f_{i,n}^{c,k})}}{\sum _{z=1}^Ze^{(f_{i,n}^{c,z})}}, \end{aligned} \end{aligned}$$

(D15)

where $f_{i,n}^{c,k}=f_{i}^{c,k}(\textbf{x}_n)$, $\textbf{f}_{i,n}^c=[f_{i,n}^{c,1},\ldots ,f_{i,n}^{c,Z}]^{\top }$, $k\in \{1,\ldots ,Z\}$. However, the Pólya-Gamma augmentation technique for binary classification can not be directly employed in the softmax function. Galy-Fajou et al. (2020) and Snell and Zemel (2021) proposed the logistic-softmax function and the one-vs-each softmax approximation respectively that enable us to employ Pólya-Gamma augmentation to obtain a conditionally conjugate model for multi-class classification tasks. Both methods mentioned above can be incorporated into our framework in the multi-class classification scenario. We refer the readers to Galy-Fajou et al. (2020); Snell and Zemel (2021) for more details.

Appendix E. Comparison with HetMOGP

One anonymous reviewer point out that an important baseline to compare against is Moreno-Muñoz et al. (2018) that can also handle regression, classification and counting data, even if the discretized Poisson distribution likelihood is used instead of the continuous point process likelihood considered in this work. Moreno-Muñoz et al. (2018) used the generic variational inference method mentioned in the introduction for parameter posterior, so this comparison can demonstrate the advantage of using data augmentation for conjugate operations.

We compare the performance of TLL and RT for HMGCP and heterogeneous multi-output Gaussian process (HetMOGP) (Moreno-Muñoz et al., 2018) on the synthetic data from Sects. 5.1 and 5.2. Since HetMOGP can only handle discrete count data, we discretize the original observation window [0, 100] into 100 bins and then calculate the number of points in each bin separately. We use the default hyperparameter settings in the demo code provided by Moreno-Muñoz et al. (2018). The results are shown in Tables 4 and 5. From Tables 4 and 5, we can see that HMGCP has the better TLL than HetMOGP that is trained on the discrete count data. For a fair comparison of efficiency, we run both HMGCP and HetMOGP on all tasks, and our inference is much faster than HetMOGP. This is because, for HetMOGP, it uses the generic variational inference, so the numerical optimization has to be performed during the variational iterations; while for our model HMGCP, the variational iterations have completely analytical expressions due to data augmentation, so it leads to the more efficient computation. It is worth noting that the running times presented in Tables 4 and 5 encompass all tasks (regression, classification, and Cox processes), resulting in longer duration compared to those reported in Sects. 5.1 and 5.2, which are solely based on the Cox process tasks.

Table 4 The performance of TLL and RT for HMGCP and HetMOGP on three synthetic datasets in Section 5.1. Time in seconds

Full size table

Table 5 The performance of TLL and RT for HMGCP and HetMOGP on synthetic datasets in Section 5.2 over ten random configurations of missing gaps with three different missing-gap widths (0 means complete data).

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, F., Kong, Q., Deng, Z. et al. Heterogeneous multi-task Gaussian Cox processes. Mach Learn 112, 5105–5134 (2023). https://doi.org/10.1007/s10994-023-06382-1

Download citation

Received: 15 November 2022
Revised: 03 July 2023
Accepted: 31 July 2023
Published: 08 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10994-023-06382-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneous multi-task Gaussian Cox processes

Abstract

Access this article

Similar content being viewed by others

Gradient Descent for Gaussian Processes Variance Reduction

Recent Advances in Scaling Up Gaussian Process Predictive Models for Large Spatiotemporal Data

Correlated product of experts for sparse Gaussian process regression

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A. Proof of augmented likelihood for classification

Appendix B. Proof of augmented likelihood for Cox process

Appendix C. Proof of mean-field approximation

1.1 The optimal density for Pólya-Gamma latent variables

1.2 The optimal intensity for marked Poisson processes

1.3 The optimal density for intensity upper-bounds

1.4 The optimal density for latent functions

Appendix D. Multi-class classification

Appendix E. Comparison with HetMOGP

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Heterogeneous multi-task Gaussian Cox processes

Abstract

Access this article

Similar content being viewed by others

Gradient Descent for Gaussian Processes Variance Reduction

Recent Advances in Scaling Up Gaussian Process Predictive Models for Large Spatiotemporal Data

Correlated product of experts for sparse Gaussian process regression

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A. Proof of augmented likelihood for classification

Appendix B. Proof of augmented likelihood for Cox process

Appendix C. Proof of mean-field approximation

1.1 The optimal density for Pólya-Gamma latent variables

1.2 The optimal intensity for marked Poisson processes

1.3 The optimal density for intensity upper-bounds

1.4 The optimal density for latent functions

Appendix D. Multi-class classification

Appendix E. Comparison with HetMOGP

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation