Skip to main content
Log in

Heterogeneous multi-task Gaussian Cox processes

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

This paper presents a novel extension of multi-task Gaussian Cox processes for modeling multiple heterogeneous correlated tasks jointly, e.g., classification and regression, via multi-output Gaussian processes (MOGP). A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks, while allowing for nonparametric parameter estimation. To circumvent the non-conjugate Bayesian inference in the MOGP modulated heterogeneous multi-task framework, we employ the data augmentation technique and derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters. We demonstrate the performance and inference on both 1D synthetic data as well as 2D urban data of Vancouver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Experiments were done on the public dataset.

Code Availability

Code is public.

Notes

  1. The notion of data augmentation in statistics is different from that in deep learning.

  2. We focus on binary classification here. Extension to multi-class classification is discussed in Sect.  D.

  3. For the compactness of notation, the task index i is sometimes moved from subscript to superscript, which does not cause confusion because we use i consistently.

  4. The income, education and non-market housing data is from the Vancouver Open Data Catalog (https://opendata.vancouver.ca/pages/home/). The crime data is from Kaggle (https://www.kaggle.com/datasets/wosaku/crime-in-vancouver).

References

  • Adams, R.P., Murray, I., MacKay, D.J. (2009). Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16.

  • Aglietti, V., Damoulas, T., Bonilla, E.V. (2019). Efficient inference in multi-task Cox process models. In: The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, pp 537–546.

  • Alvarez, M.A., Lawrence, N.D. (2008). Sparse convolved Gaussian processes for multi-output regression. In: NIPS, pp 57–64.

  • Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3), 195–266.

    Article  MATH  Google Scholar 

  • Álvarez, M.A., Ward, W., Guarnizo, C. (2019). Non-linear process convolutions for multi-output Gaussian processes. In: The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, pp 1969–1977.

  • Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.

    Article  Google Scholar 

  • Besag, J. (1994). Discussion on the paper by Grenander and Miller. Journal of the Royal Statistical Society: Series B (Methodological), 56, 591–592.

    Google Scholar 

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Cham: Springer.

    MATH  Google Scholar 

  • Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.

    Article  MathSciNet  Google Scholar 

  • Bonilla, E.V., Chai, K.M.A., Williams, C.K.I. (2007). Multi-task Gaussian process prediction. In: Platt JC, Koller D, Singer Y, et al (eds) Advances in neural information processing systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. Curran Associates, Inc., pp 153–160.

  • Cunningham, J.P., Shenoy, K.V., Sahani, M. (2008). Fast Gaussian process methods for point process intensity estimation. In: International Conference on Machine Learning, ACM, pp 192–199.

  • Daley, D.J., Vere-Jones, D. (2003). An introduction to the theory of point processes. Vol. I. Probability and its applications.

  • Dezfouli, A., & Bonilla, E. V. (2015). Scalable inference for Gaussian process models with black-box likelihoods. Advances in Neural Information Processing Systems, 28, 1414–1422.

    Google Scholar 

  • Diggle, P. J., Moraga, P., Rowlingson, B., et al. (2013). Spatial and spatio-temporal log-Gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science, 28(4), 542–563.

    Article  MathSciNet  MATH  Google Scholar 

  • Donner, C., & Opper, M. (2018). Efficient Bayesian inference of sigmoidal Gaussian Cox processes. Journal of Machine Learning Research, 19(1), 2710–2743.

    MATH  Google Scholar 

  • Galy-Fajou, T., Wenzel, F., Donner, C., et al. (2020). Multi-class Gaussian process classification made conjugate: Efficient inference via data augmentation. In: Uncertainty in Artificial Intelligence, PMLR, pp 755–765.

  • Hensman, J., Matthews, A., Ghahramani, Z. (2015). Scalable variational Gaussian process classification. In: Artificial Intelligence and Statistics, PMLR, pp 351–360.

  • Hoffman, M.D., Blei, D.M,. Wang, C., et al. (2013). Stochastic variational inference. Journal of Machine Learning Research 14(5).

  • Jahani, S., Zhou, S., Veeramani, D., et al. (2021). Multioutput Gaussian process modulated Poisson processes for event prediction. IEEE Transactions on Reliability.

  • Journel, A. G., & Huijbregts, C. J. (1976). Mining geostatistics. London: Academic Press.

    Google Scholar 

  • Lasko, T.A. (2014). Efficient inference of Gaussian-process-modulated renewal processes with application to medical event data. In: Uncertainty in artificial intelligence: Proceedings of the Conference on Uncertainty in Artificial Intelligence, NIH Public Access, p 469.

  • Li, C., Zhu, J., Chen, J. (2014). Bayesian max-margin multi-task learning with data augmentation. In: International Conference on Machine Learning, PMLR, pp 415–423.

  • Lian, W., Henao, R., Rao, V., et al. (2015). A multitask point process predictive model. In: International Conference on Machine Learning, PMLR, pp 2030–2038.

  • Lloyd, C., Gunter, T., Osborne, M., et al. (2015). Variational inference for Gaussian process modulated Poisson processes. In: International Conference on Machine Learning, pp 1814–1822.

  • Møller, J., Syversveen, A. R., & Waagepetersen, R. P. (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3), 451–482.

    Article  MathSciNet  MATH  Google Scholar 

  • Moreno-Muñoz, P., Artés, A., Álvarez, M. (2018). Heterogeneous multi-output Gaussian process prediction. Advances in Neural Information Processing Systems 31

  • Mutny, M., Krause, A. (2021). No-regret algorithms for capturing events in Poisson point processes. In: International Conference on Machine Learning, PMLR, pp 7894–7904.

  • Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science: University of Toronto Toronto, ON, Canada.

  • Nguyen, T. V., & Bonilla, E. V. (2014). Automated variational inference for Gaussian process models. Advances in Neural Information Processing Systems, 27, 1404–1412.

    Google Scholar 

  • Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504), 1339–1349.

    Article  MathSciNet  MATH  Google Scholar 

  • Rasmussen, C.E. (2003). Gaussian processes in machine learning. In: Summer School on Machine Learning, Springer, pp 63–71.

  • Shirota, S., Gelfand, A.E. (2017) Space and circular time log Gaussian Cox processes with application to crime event data. The Annals of Applied Statistics pp 481–503.

  • Snell, J., Zemel, R.S. (2021). Bayesian few-shot classification with one-vs-each Pólya-Gamma augmented Gaussian processes. In: International Conference on Learning Representations, ICLR 2021. OpenReview.net.

  • Soleimani, H., Hensman, J., & Saria, S. (2017). Scalable joint models for reliable uncertainty-aware event prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(8), 1948–1963.

    Article  Google Scholar 

  • Taylor, B. M., Davies, T. M., Rowlingson, B. S., et al. (2015). Bayesian inference and data augmentation schemes for spatial, spatiotemporal and multivariate log-Gaussian Cox processes in R. Journal of Statistical Software, 63(1), 1–48.

    Google Scholar 

  • Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp 567–574.

  • Uspensky, J. V., et al. (1937). Introduction to mathematical probability. New York: McGraw-Hill Book Co., Inc.

    MATH  Google Scholar 

  • Ver Hoef, J. M., & Barry, R. P. (1998). Constructing and fitting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69(2), 275–294.

    Article  MathSciNet  MATH  Google Scholar 

  • Wenzel, F., Galy-Fajou, T., Donner, C., et al. (2019). Efficient Gaussian process classification using Pólya-Gamma data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 5417–5424.

  • Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2). MIT Press.

    MATH  Google Scholar 

  • Wood, F., Meent, J.W., Mansinghka, V. (2014). A new approach to probabilistic programming inference. In: Artificial intelligence and statistics, PMLR, pp 1024–1032.

  • Zhou, F., Li, Z., Fan, X., et al. (2020). Efficient inference for nonparametric Hawkes processes using auxiliary latent variables. Journal of Machine Learning Research, 21(241), 1–31.

    MathSciNet  MATH  Google Scholar 

  • Zhou, F., Zhang, Y., Zhu, J. (2021). Efficient inference of flexible interaction in spiking-neuron networks. In: International Conference on Learning Representations.

  • Zhou, F., Kong, Q., Deng, Z., et al. (2022). Efficient inference for dynamic flexible interactions of neural populations. Journal of Machine Learning Research, 23(211), 1–49.

    MathSciNet  Google Scholar 

Download references

Funding

This work was conceived and conducted when FZ was doing post-doc at Tsinghua University. This work was supported by NSFC Projects (Nos. 62061136001, 62106121, 61621136008, U19A2081), the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD110001), the National Key Research and Development Program of China (2021ZD0110502), the major key project of PCL (No. PCL2021A12), the fund for building world-class universities (disciplines) of Renmin University of China (No. KYGJC2023012), and Tsinghua Guo Qiang Institute, and the High Performance Computing Center, Tsinghua University, and the Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

Authors

Contributions

All authors have contributed significantly to this work. The theoretical derivation was primarily performed by FZ, and the experimental validation was primarily performed by FZ, QK and ZD with support from FH and PC. All authors wrote the manuscript. JZ supervised the whole project.

Corresponding author

Correspondence to Jun Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Willem Waegeman.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proof of augmented likelihood for classification

Substituting Eq. (2) in the paper into the classification likelihood Eq. (1b) in the paper, we can obtain

$$\begin{aligned} p(\textbf{y}^{c}\mid \{g_{i}^c\}_{i=1}^{I_c})=\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_i}\int _0^\infty e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0)d\omega ^c_{i,n}, \end{aligned}$$
(A1)

where the integrand is the augmented likelihood:

$$\begin{aligned} p(\textbf{y}^c,\varvec{\omega }^c\mid \{g^c_i\}_{i=1}^{I_c})=\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_i} e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0). \end{aligned}$$
(A2)

Appendix B. Proof of augmented likelihood for Cox process

Substituting Eqs. (2) and (4) in the paper into the product and exponential integral terms respectively in the Cox process likelihood Eq. (1c) in the paper, we can obtain

$$\begin{aligned} \begin{aligned}&p(\textbf{x}^p\mid \{\bar{\lambda }_i,g^p_i\}_{i=1}^{I_p})=\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\int _0^\infty \Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}d\omega ^p_{i,n}\\&\int _{\mathcal {X}}\int _0^\infty p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}d\omega d\textbf{x}, \end{aligned} \end{aligned}$$
(B3)

where the integrand is the augmented likelihood:

$$\begin{aligned} \begin{aligned}&p(\textbf{x}^p,\varvec{\omega }^p,\Pi \mid \bar{\varvec{\lambda }}, \{g^p_i\}_{i=1}^{I_p})=\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\\&\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}. \end{aligned} \end{aligned}$$
(B4)

Appendix C. Proof of mean-field approximation

The augmented joint distribution can be written as:

$$\begin{aligned} \begin{aligned}&p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})\\ =&\underbrace{p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})}_{\text {regression}}\underbrace{p(\textbf{y}^c,\varvec{\omega }^c\mid \{g^c_i\}_{i=1}^{I_c})}_{\text {augmented classification}}\underbrace{p(\textbf{x}^p,\varvec{\omega }^p,\Pi \mid \bar{\varvec{\lambda }}, \{g^p_i\}_{i=1}^{I_p})}_{\text {augmented Cox process}}\underbrace{p(g)}_{\text {MOGP}}p(\bar{\varvec{\lambda }})\\ =&\prod _{i=1}^{I_r}\prod _{n=1}^{N_{i}^r}\mathcal {N}(y^r_{i,n}\mid g^r_{i,n},\sigma _i^2)\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_i}e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0)\\&\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}p(g)p(\bar{\varvec{\lambda }}). \end{aligned} \end{aligned}$$
(C5)

Here, we assume the variational posterior \(q(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})=q_1(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )q_2(g,\bar{\varvec{\lambda }})\). To minimize the KL divergence between variational posterior and true posterior, it can be proved that the optimal distribution of each factor is the expectation of the logarithm of the joint distribution taken over variables in the other factor (Bishop, 2006):

$$\begin{aligned} \begin{aligned} q_1^*(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )&\propto e^{{\mathbb {E}_{q_2}[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]}},\\ q_2^*(g,\bar{\varvec{\lambda }})&\propto e^{{\mathbb {E}_{q_1}[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]}}. \end{aligned} \end{aligned}$$
(C6)

Substituting Eq. (C5) into Eq. (C6), we can obtain the optimal variational distributions. The process of deriving variational posteriors for \(\varvec{\omega }^c\), \(\varvec{\omega }^p\), \(\Pi\), and \(\bar{\varvec{\lambda }}\) is similar to that in Donner and Opper (2018). The primary distinction lies in the treatment of the latent function g. Further details are provided below.

1.1 The optimal density for Pólya-Gamma latent variables

The optimal variational posteriors of \(\varvec{\omega }^c\) and \(\varvec{\omega }^p\) are

$$\begin{aligned} \begin{aligned} q_1(\varvec{\omega }^c)=\prod _{i=1}^{I_c}\prod _{n=1}^{N_i^c}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,\tilde{g}^c_{i,n}),\ \ \ \ q_1(\varvec{\omega }^p)=\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}p_{\text {PG}}(\omega ^p_{i,n}\mid 1,\tilde{g}^p_{i,n}), \end{aligned} \end{aligned}$$
(C7)

where \(\tilde{g}^\cdot _{i,n}=\sqrt{\mathbb {E}[{g^\cdot _{i,n}}^2]}\) and we adopt the tilted Pólya-Gamma distribution \(p_{\text {PG}}(\omega \mid b,c)\propto e^{-c^2\omega /2}p_{\text {PG}}(\omega \mid b,0)\) (Polson et al., 2013).

1.2 The optimal intensity for marked Poisson processes

The derivation of optimal variational posterior of \(\Pi =\{\Pi _i\}_{i=1}^{I_p}\) is challenging, so we provide some details below. After taking expectation, we can obtain

$$\begin{aligned} \begin{aligned} q_1(\Pi _i) =&\frac{p_{\tilde{\Lambda }_i}(\Pi _i\mid \bar{\lambda }^1_i)\prod _{(\omega ,\textbf{x}) \in \Pi _i}e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2} \omega -\log 2}}{\iint p_{\tilde{\Lambda }_i}(\Pi _i\mid \bar{\lambda }^1_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}d\omega d\textbf{x}}\\ =&p_{\tilde{\Lambda }_i}(\Pi _i\mid \bar{\lambda }^1_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}\\&\exp {\left( \int _{\mathcal {X}}\int _0^\infty (1-e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2} -\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2})\bar{\lambda }^1_i p_{\text {PG}} (\omega \mid 1,0)d\omega d\textbf{x}\right) }\\ =&\prod _{(\omega ,\textbf{x})\in \Pi _i}\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0) e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}\\&\exp {\left( -\int _{\mathcal {X}}\int _0^\infty \bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2} \omega -\log 2}d\omega d\textbf{x}\right) }, \end{aligned} \end{aligned}$$
(C8)

where \(\bar{\lambda }^1_i=e^{\mathbb {E}[\log \bar{\lambda }_i]}\) and \(\tilde{\Lambda }_i(\textbf{x},\omega )=\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)\). The second line of Eq. (C8) used Campbell’s theorem \(\mathbb {E}_{\Pi _i}\left[ \exp {\left( \sum _{(\textbf{x},\omega )\in \Pi _i}h(\textbf{x},\omega )\right) }\right] =\exp {\left[ \iint \left( e^{ h(\textbf{x},\omega )}-1\right) \tilde{\Lambda }_i(\textbf{x},\omega )d\omega d\textbf{x}\right] }\). It is easy to see the posterior intensity of \(\Pi _i\) is

$$\begin{aligned} \begin{aligned} \Lambda _i^1(\textbf{x},\omega )&=\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)e^{-\frac{\mathbb {E}[g^p_i(\textbf{x})]}{2}-\frac{\mathbb {E}[{g^p_i(\textbf{x})}^2]}{2}\omega -\log 2}\\&=\bar{\lambda }_i^1 s(-\tilde{g}^p_i(\textbf{x}))p_{\text {PG}}(\omega \mid 1,\tilde{g}^p_i(\textbf{x}))e^{(\tilde{g}^p_i(\textbf{x})-\bar{g}^p_i(\textbf{x}))/2}, \end{aligned} \end{aligned}$$
(C9)

where we adopt \(e^{-c^2\omega /2}p_{\text {PG}}(\omega \mid b,0)=2\,s(-c)e^{c/2}p_{\text {PG}}(\omega \mid b,c)\) (Polson et al., 2013), \(\tilde{g}_i^p(\textbf{x})=\sqrt{\mathbb {E}[{g_i^p(\textbf{x})}^2]}\), \(\bar{g}_i^p(\textbf{x})=\mathbb {E}[g_i^p(\textbf{x})]\).

1.3 The optimal density for intensity upper-bounds

The optimal variational posterior of \(\bar{\varvec{\lambda }}\) is

$$\begin{aligned} \begin{aligned} q_2(\bar{\varvec{\lambda }})=\prod _{i=1}^{I_p}p_{\text {Ga}}(\bar{\lambda }_i\mid N_i^p+R_i,|\mathcal {X}|), \end{aligned} \end{aligned}$$
(C10)

where \(p_{\text {Ga}}\) is Gamma density, \(R_i=\int _\mathcal {X}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega d\textbf{x}\), \(|\mathcal {X}|\) is the domain size.

1.4 The optimal density for latent functions

The derivation of optimal variational posterior of g is challenging, so we provide some details below. After taking expectation, we can obtain

$$\begin{aligned} \begin{aligned}&\log q_2(g)=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(y_{i,n}^r\mid g_{i,n}^r,\sigma _i^2)+\sum _{i=1}^{I_c}\sum _{n=1}^{N_i^c}\left[ \frac{y_{i.n}^c g_{i,n}^c}{2}-\frac{{g_{i,n}^c}^2}{2}\mathbb {E}[\omega _{i,n}^c]\right] +\\&\sum _{i=1}^{I_p}\left[ \sum _{n=1}^{N_i^p}\frac{g_{i,n}^p}{2}-\frac{{g_{i,n}^p}^2}{2}\mathbb {E}[\omega _{i,n}^p]-\mathbb {E}_{\Pi _i}\sum _{(\omega ,\textbf{x})\in \Pi _i}\frac{g_i^p(\textbf{x})}{2}+\frac{{g_i^p(\textbf{x})}^2}{2}\omega \right] +\log p(g)+C\\&=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(g_{i,n}^r\mid y_{i,n}^r,\sigma _i^2)+\sum _{i=1}^{I_c}\sum _{n=1}^{N_i^c}\log \mathcal {N}(g_{i,n}^c\mid \frac{y_{i,n}^c}{2\mathbb {E}[\omega _{i,n}^c]},\frac{1}{\mathbb {E}[\omega _{i,n}^c]})\\&+\sum _{i=1}^{I_p}\left[ \int _{\mathcal {X}}\sum _{n=1}^{N_i^p}\left( \frac{g_{i}^p(\textbf{x})}{2} -\frac{{g_{i}^p(\textbf{x})}^2}{2}\mathbb {E}[\omega _{i,n}^p]\right) \delta (\textbf{x}-\textbf{x}^p_{i,n}) d\textbf{x}\right. \\&\left. -\int _{\mathcal {X}}\int _0^\infty \left( \frac{g_i^p(\textbf{x})}{2}+\frac{{g_i^p(\textbf{x})}^2}{2} \omega \right) \Lambda _i^1(\textbf{x},\omega )d\omega d\textbf{x}\right] +\log p(g)+C\\&=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(g_{i,n}^r\mid y_{i,n}^r,\sigma _i^2)+\sum _{i=1}^{I_c} \sum _{n=1}^{N_i^c}\log \mathcal {N}(g_{i,n}^c\mid \frac{y_{i,n}^c}{2\mathbb {E}[\omega _{i,n}^c]},\frac{1}{\mathbb {E}[\omega _{i,n}^c]})\\&+\sum _{i=1}^{I_p}\left[ \int _{\mathcal {X}}\left( \frac{1}{2}\sum _{n=1}^{N^p_i}\delta (\textbf{x} -\textbf{x}^p_{i,n})-\frac{1}{2}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega \right) g_i^p (\textbf{x})d\textbf{x}+\log p(g)\right. \\&\left. -\frac{1}{2}\int _{\mathcal {X}}\left( \sum _{n=1}^{N_i^p}\mathbb {E}[\omega _{i,n}^p]\delta (\textbf{x}-\textbf{x}^p_{i,n})+\int _0^\infty \omega \Lambda _i^1(\textbf{x},\omega )d\omega \right) {g_i^p(\textbf{x})}^2 d\textbf{x}\right] +C\\&=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}\log \mathcal {N}(g_{i,n}^r\mid y_{i,n}^r,\sigma _i^2) +\sum _{i=1}^{I_c}\sum _{n=1}^{N_i^c}\log \mathcal {N}(g_{i,n}^c\mid \frac{y_{i,n}^c}{2\mathbb {E} [\omega _{i,n}^c]},\frac{1}{\mathbb {E}[\omega _{i,n}^c]})\\&+\sum _{i=1}^{I_p}\left[ \int _{\mathcal {X}} B_i(\textbf{x})g_i^p(\textbf{x})d\textbf{x}-\frac{1}{2}\int _{\mathcal {X}} A_i(\textbf{x}){g_i^p(\textbf{x})}^2 d\textbf{x}\right] +\log p(g)+C, \end{aligned} \end{aligned}$$
(C11)

where \(A_i(\textbf{x})=\sum _{n=1}^{N^p_i}\mathbb {E}[\omega _{i,n}^p]\delta (\textbf{x}-\textbf{x}^p_{i,n})+\int _0^\infty \omega \Lambda _i^1(\textbf{x},\omega )d\omega\) and \(B_i(\textbf{x})=\frac{1}{2}\sum _{n=1}^{N^p_i}\delta (\textbf{x}-\textbf{x}^p_{i,n})-\frac{1}{2}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega\).

The computation of Eq. (C11) suffers from a cubic complexity w.r.t. the number of data points in regression, classification and point process tasks. We use the inducing inputs formalism to make the inference scalable. We denote M inducing inputs \([\textbf{x}_1\,\ldots ,\textbf{x}_M]^\top\) on the domain \(\mathcal {X}\) for each task. The function values of basis function \(f_q\) at these inducing inputs are defined as \(\textbf{f}_{q,\textbf{x}_m}\). Then we can obtain the function values of task-specific latent function \(g_i\) at these inducing inputs \(\textbf{g}_{\textbf{x}_m}^i=\sum _{q=1}^Qw_{i,q}\textbf{f}_{q,\textbf{x}_m}\). If we define \(\textbf{g}_{\textbf{x}_m}=[\textbf{g}^{\top }_{1,\textbf{x}_m},\ldots ,\textbf{g}^{\top }_{I,\textbf{x}_m}]^\top\), \(\textbf{g}_{\textbf{x}_m}\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m})\) where \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}\) is the MOGP covariance on \(\textbf{x}_m\) for all tasks and \(\textbf{g}_{\textbf{x}_m}^i\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m}^i)\) where \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}^i\) is i-th diagonal block of \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}\). Given \(\textbf{g}_{\textbf{x}_m}^i\), we assume the function \(g_i(\textbf{x})\) is the posterior mean function \(g_i(\textbf{x})=\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{g}_{\textbf{x}_m}^i\) where \(\textbf{k}_{\textbf{x}_m \textbf{x}}^{i}\) is the kernel w.r.t. inducing points and predictive points for i-th task. Therefore, \(\{g_{i,n}^r\}_{i=1}^{I_r}\), \(\{g_{i,n}^c\}_{i=1}^{I_c}\) and \(\{g_i^p(\textbf{x})\}_{i=1}^{I_p}\) can be written as

$$\begin{aligned} \begin{aligned} \textbf{g}_{i}^r=\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{g}_{\textbf{x}_m}^{r,i},\textbf{g}_{i}^c=\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{g}_{\textbf{x}_m}^{c,i}, g_i^p(\textbf{x})=\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\textbf{g}_{\textbf{x}_m}^{p,i}, \end{aligned} \end{aligned}$$
(C12)

where \(\textbf{g}_{i}^r=[g_{i,1}^r,\ldots ,g_{i,N_i^r}^r]^\top\), \(\textbf{g}_{i}^c=[g_{i,1}^c,\ldots ,g_{i,N_i^c}^c]^\top\), \(g_i^p(\textbf{x})\) is the function value of \(g_i^p\) on \(\textbf{x}\).

Substituting Eq. (C12) into Eq. (C11), we obtain the inducing points version of Eq. (C11):

$$\begin{aligned} \begin{aligned}&q_2(\textbf{g}_{\textbf{x}_m})\propto \prod _{i=1}^{I_r}\mathcal {N}(\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{g}_{\textbf{x}_m}^{r,i}\mid \textbf{y}_i^r,\text {diag}(\sigma _i^2))\\&\cdot \prod _{i=1}^{I_c}\mathcal {N}(\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{g}_{\textbf{x}_m}^{c,i}\mid \frac{\textbf{y}_{i}^c}{2\mathbb {E}[\varvec{\omega }_{i}^c]},\text {diag}(\frac{1}{\mathbb {E}[\varvec{\omega }_{i}^c]}))\\&\cdot \prod _{i=1}^{I_p}\exp \left( \int _{\mathcal {X}} B_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top } d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\textbf{g}_{\textbf{x}_m}^{p,i}\right. \\&\left. -\frac{1}{2}\textbf{g}_{\textbf{x}_m}^{p,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\int _{\mathcal {X}} A_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i}\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top } d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\textbf{g}_{\textbf{x}_m}^{p,i}\right) \\&\cdot \mathcal {N}(\textbf{g}_{\textbf{x}_m}\mid \textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m}). \end{aligned} \end{aligned}$$
(C13)

It is easy to see the third line of Eq. (C13) is a multivariate Gaussian distribution of \(\textbf{g}_{\textbf{x}_m}^{p,i}\). The likelihoods of \(\textbf{g}_{\textbf{x}_m}^{r,i}\) for regression, \(\textbf{g}_{\textbf{x}_m}^{c,i}\) for classification and \(\textbf{g}_{\textbf{x}_m}^{p,i}\) for point process tasks are all Gaussian distributions, so they are conjugate to the MOGP prior and we can obtain the closed-form variational posterior for \(\textbf{g}_{\textbf{x}_m}\):

$$\begin{aligned} \begin{aligned} q_2(\textbf{g}_{\textbf{x}_m})=\mathcal {N}(\textbf{g}_{\textbf{x}_m}\mid \textbf{m}_{\textbf{x}_m}, \mathbf {\Sigma }_{\textbf{x}_m}), \end{aligned} \end{aligned}$$
(C14)

where \(\textbf{g}_{\textbf{x}_m}=[\textbf{g}_{\textbf{x}_m}^{r\top },\textbf{g}_{\textbf{x}_m}^{c\top }, \textbf{g}_{\textbf{x}_m}^{p\top }]^\top\), \(\textbf{g}_{\textbf{x}_m}^{\cdot }=[\textbf{g}_{1,\textbf{x}_m}^{\cdot \top },\ldots , \textbf{g}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top\) and

$$\begin{aligned} \mathbf {\Sigma }_{\textbf{x}_m}=\left[ \text {diag}\left( \textbf{H}_{\textbf{x}_m}^r, \textbf{H}_{\textbf{x}_m}^c,\textbf{H}_{\textbf{x}_m}^p\right) +\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{-1}\right] ^{-1},\textbf{m}_{\textbf{x}_m}=\mathbf {\Sigma }_{\textbf{x}_m} [\textbf{v}_{\textbf{x}_m}^{r\top },\textbf{v}_{\textbf{x}_m}^{c\top },\textbf{v}_{\textbf{x}_m}^{p\top }]^\top , \end{aligned}$$

where \(\textbf{H}_{\textbf{x}_m}^\cdot =\text {diag}(\textbf{H}_{1,\textbf{x}_m}^\cdot ,\ldots , \textbf{H}_{I_\cdot ,\textbf{x}_m}^\cdot )\), \(\textbf{v}_{\textbf{x}_m}^\cdot =[\textbf{v}_{1,\textbf{x}_m}^{\cdot \top },\ldots , \textbf{v}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top\) and

$$\begin{aligned} \begin{aligned} \textbf{H}_{i,\textbf{x}_m}^r=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i}\textbf{D}^r_i\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}, \textbf{v}_{i,\textbf{x}_m}^{r}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}} \textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i}\frac{\textbf{y}^r_i}{\sigma _i^2},\\ \textbf{H}_{i,\textbf{x}_m}^c=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i}\textbf{D}^c_i\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}, \textbf{v}_{i,\textbf{x}_m}^{c}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}} \textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i}\frac{\textbf{y}^c_i}{2},\\ \textbf{H}_{i,\textbf{x}_m}^p=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}} \int _{\mathcal {X}}A_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i} \textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top }d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}},\\ \textbf{v}_{i,\textbf{x}_m}^{p}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\int _{\mathcal {X}}B_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i}d\textbf{x}, \end{aligned} \end{aligned}$$

where \(\textbf{D}^r_i=\text {diag}(1/\sigma _i^2)\) and \(\textbf{D}^c_i=\text {diag}(\mathbb {E}[\varvec{\omega }^{c}_i])\).

Appendix D. Multi-class classification

In the paper, we mainly focus on the binary classification problem because each binary classification task corresponds to a single latent function. This setting is consistent with the regression and point process tasks in which each task only specifies a single latent function.

For Z-class classification problem, each task corresponds to Z latent functions. The usual likelihood for multi-class classification is the softmax function:

$$\begin{aligned} \begin{aligned} p(y_{i,n}^c=k\mid \textbf{f}_{i,n}^c)=\frac{e^{(f_{i,n}^{c,k})}}{\sum _{z=1}^Ze^{(f_{i,n}^{c,z})}}, \end{aligned} \end{aligned}$$
(D15)

where \(f_{i,n}^{c,k}=f_{i}^{c,k}(\textbf{x}_n)\), \(\textbf{f}_{i,n}^c=[f_{i,n}^{c,1},\ldots ,f_{i,n}^{c,Z}]^{\top }\), \(k\in \{1,\ldots ,Z\}\). However, the Pólya-Gamma augmentation technique for binary classification can not be directly employed in the softmax function. Galy-Fajou et al. (2020) and Snell and Zemel (2021) proposed the logistic-softmax function and the one-vs-each softmax approximation respectively that enable us to employ Pólya-Gamma augmentation to obtain a conditionally conjugate model for multi-class classification tasks. Both methods mentioned above can be incorporated into our framework in the multi-class classification scenario. We refer the readers to Galy-Fajou et al. (2020); Snell and Zemel (2021) for more details.

Appendix E. Comparison with HetMOGP

One anonymous reviewer point out that an important baseline to compare against is Moreno-Muñoz et al. (2018) that can also handle regression, classification and counting data, even if the discretized Poisson distribution likelihood is used instead of the continuous point process likelihood considered in this work. Moreno-Muñoz et al. (2018) used the generic variational inference method mentioned in the introduction for parameter posterior, so this comparison can demonstrate the advantage of using data augmentation for conjugate operations.

We compare the performance of TLL and RT for HMGCP and heterogeneous multi-output Gaussian process (HetMOGP) (Moreno-Muñoz et al., 2018) on the synthetic data from Sects. 5.1 and 5.2. Since HetMOGP can only handle discrete count data, we discretize the original observation window [0, 100] into 100 bins and then calculate the number of points in each bin separately. We use the default hyperparameter settings in the demo code provided by Moreno-Muñoz et al. (2018). The results are shown in Tables 4 and 5. From Tables 4 and 5, we can see that HMGCP has the better TLL than HetMOGP that is trained on the discrete count data. For a fair comparison of efficiency, we run both HMGCP and HetMOGP on all tasks, and our inference is much faster than HetMOGP. This is because, for HetMOGP, it uses the generic variational inference, so the numerical optimization has to be performed during the variational iterations; while for our model HMGCP, the variational iterations have completely analytical expressions due to data augmentation, so it leads to the more efficient computation. It is worth noting that the running times presented in Tables 4 and 5 encompass all tasks (regression, classification, and Cox processes), resulting in longer duration compared to those reported in Sects. 5.1 and 5.2, which are solely based on the Cox process tasks.

Table 4 The performance of TLL and RT for HMGCP and HetMOGP on three synthetic datasets in Section 5.1. Time in seconds
Table 5 The performance of TLL and RT for HMGCP and HetMOGP on synthetic datasets in Section 5.2 over ten random configurations of missing gaps with three different missing-gap widths (0 means complete data).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, F., Kong, Q., Deng, Z. et al. Heterogeneous multi-task Gaussian Cox processes. Mach Learn 112, 5105–5134 (2023). https://doi.org/10.1007/s10994-023-06382-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06382-1

Keywords

Navigation