Abstract
This paper presents a novel extension of multi-task Gaussian Cox processes for modeling multiple heterogeneous correlated tasks jointly, e.g., classification and regression, via multi-output Gaussian processes (MOGP). A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks, while allowing for nonparametric parameter estimation. To circumvent the non-conjugate Bayesian inference in the MOGP modulated heterogeneous multi-task framework, we employ the data augmentation technique and derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters. We demonstrate the performance and inference on both 1D synthetic data as well as 2D urban data of Vancouver.
Similar content being viewed by others
Data Availability
Experiments were done on the public dataset.
Code Availability
Code is public.
Notes
The notion of data augmentation in statistics is different from that in deep learning.
We focus on binary classification here. Extension to multi-class classification is discussed in Sect. D.
For the compactness of notation, the task index i is sometimes moved from subscript to superscript, which does not cause confusion because we use i consistently.
The income, education and non-market housing data is from the Vancouver Open Data Catalog (https://opendata.vancouver.ca/pages/home/). The crime data is from Kaggle (https://www.kaggle.com/datasets/wosaku/crime-in-vancouver).
References
Adams, R.P., Murray, I., MacKay, D.J. (2009). Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16.
Aglietti, V., Damoulas, T., Bonilla, E.V. (2019). Efficient inference in multi-task Cox process models. In: The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, pp 537–546.
Alvarez, M.A., Lawrence, N.D. (2008). Sparse convolved Gaussian processes for multi-output regression. In: NIPS, pp 57–64.
Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3), 195–266.
Álvarez, M.A., Ward, W., Guarnizo, C. (2019). Non-linear process convolutions for multi-output Gaussian processes. In: The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, pp 1969–1977.
Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
Besag, J. (1994). Discussion on the paper by Grenander and Miller. Journal of the Royal Statistical Society: Series B (Methodological), 56, 591–592.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Cham: Springer.
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.
Bonilla, E.V., Chai, K.M.A., Williams, C.K.I. (2007). Multi-task Gaussian process prediction. In: Platt JC, Koller D, Singer Y, et al (eds) Advances in neural information processing systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. Curran Associates, Inc., pp 153–160.
Cunningham, J.P., Shenoy, K.V., Sahani, M. (2008). Fast Gaussian process methods for point process intensity estimation. In: International Conference on Machine Learning, ACM, pp 192–199.
Daley, D.J., Vere-Jones, D. (2003). An introduction to the theory of point processes. Vol. I. Probability and its applications.
Dezfouli, A., & Bonilla, E. V. (2015). Scalable inference for Gaussian process models with black-box likelihoods. Advances in Neural Information Processing Systems, 28, 1414–1422.
Diggle, P. J., Moraga, P., Rowlingson, B., et al. (2013). Spatial and spatio-temporal log-Gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science, 28(4), 542–563.
Donner, C., & Opper, M. (2018). Efficient Bayesian inference of sigmoidal Gaussian Cox processes. Journal of Machine Learning Research, 19(1), 2710–2743.
Galy-Fajou, T., Wenzel, F., Donner, C., et al. (2020). Multi-class Gaussian process classification made conjugate: Efficient inference via data augmentation. In: Uncertainty in Artificial Intelligence, PMLR, pp 755–765.
Hensman, J., Matthews, A., Ghahramani, Z. (2015). Scalable variational Gaussian process classification. In: Artificial Intelligence and Statistics, PMLR, pp 351–360.
Hoffman, M.D., Blei, D.M,. Wang, C., et al. (2013). Stochastic variational inference. Journal of Machine Learning Research 14(5).
Jahani, S., Zhou, S., Veeramani, D., et al. (2021). Multioutput Gaussian process modulated Poisson processes for event prediction. IEEE Transactions on Reliability.
Journel, A. G., & Huijbregts, C. J. (1976). Mining geostatistics. London: Academic Press.
Lasko, T.A. (2014). Efficient inference of Gaussian-process-modulated renewal processes with application to medical event data. In: Uncertainty in artificial intelligence: Proceedings of the Conference on Uncertainty in Artificial Intelligence, NIH Public Access, p 469.
Li, C., Zhu, J., Chen, J. (2014). Bayesian max-margin multi-task learning with data augmentation. In: International Conference on Machine Learning, PMLR, pp 415–423.
Lian, W., Henao, R., Rao, V., et al. (2015). A multitask point process predictive model. In: International Conference on Machine Learning, PMLR, pp 2030–2038.
Lloyd, C., Gunter, T., Osborne, M., et al. (2015). Variational inference for Gaussian process modulated Poisson processes. In: International Conference on Machine Learning, pp 1814–1822.
Møller, J., Syversveen, A. R., & Waagepetersen, R. P. (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3), 451–482.
Moreno-Muñoz, P., Artés, A., Álvarez, M. (2018). Heterogeneous multi-output Gaussian process prediction. Advances in Neural Information Processing Systems 31
Mutny, M., Krause, A. (2021). No-regret algorithms for capturing events in Poisson point processes. In: International Conference on Machine Learning, PMLR, pp 7894–7904.
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science: University of Toronto Toronto, ON, Canada.
Nguyen, T. V., & Bonilla, E. V. (2014). Automated variational inference for Gaussian process models. Advances in Neural Information Processing Systems, 27, 1404–1412.
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504), 1339–1349.
Rasmussen, C.E. (2003). Gaussian processes in machine learning. In: Summer School on Machine Learning, Springer, pp 63–71.
Shirota, S., Gelfand, A.E. (2017) Space and circular time log Gaussian Cox processes with application to crime event data. The Annals of Applied Statistics pp 481–503.
Snell, J., Zemel, R.S. (2021). Bayesian few-shot classification with one-vs-each Pólya-Gamma augmented Gaussian processes. In: International Conference on Learning Representations, ICLR 2021. OpenReview.net.
Soleimani, H., Hensman, J., & Saria, S. (2017). Scalable joint models for reliable uncertainty-aware event prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(8), 1948–1963.
Taylor, B. M., Davies, T. M., Rowlingson, B. S., et al. (2015). Bayesian inference and data augmentation schemes for spatial, spatiotemporal and multivariate log-Gaussian Cox processes in R. Journal of Statistical Software, 63(1), 1–48.
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp 567–574.
Uspensky, J. V., et al. (1937). Introduction to mathematical probability. New York: McGraw-Hill Book Co., Inc.
Ver Hoef, J. M., & Barry, R. P. (1998). Constructing and fitting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69(2), 275–294.
Wenzel, F., Galy-Fajou, T., Donner, C., et al. (2019). Efficient Gaussian process classification using Pólya-Gamma data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 5417–5424.
Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2). MIT Press.
Wood, F., Meent, J.W., Mansinghka, V. (2014). A new approach to probabilistic programming inference. In: Artificial intelligence and statistics, PMLR, pp 1024–1032.
Zhou, F., Li, Z., Fan, X., et al. (2020). Efficient inference for nonparametric Hawkes processes using auxiliary latent variables. Journal of Machine Learning Research, 21(241), 1–31.
Zhou, F., Zhang, Y., Zhu, J. (2021). Efficient inference of flexible interaction in spiking-neuron networks. In: International Conference on Learning Representations.
Zhou, F., Kong, Q., Deng, Z., et al. (2022). Efficient inference for dynamic flexible interactions of neural populations. Journal of Machine Learning Research, 23(211), 1–49.
Funding
This work was conceived and conducted when FZ was doing post-doc at Tsinghua University. This work was supported by NSFC Projects (Nos. 62061136001, 62106121, 61621136008, U19A2081), the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD110001), the National Key Research and Development Program of China (2021ZD0110502), the major key project of PCL (No. PCL2021A12), the fund for building world-class universities (disciplines) of Renmin University of China (No. KYGJC2023012), and Tsinghua Guo Qiang Institute, and the High Performance Computing Center, Tsinghua University, and the Public Computing Cloud, Renmin University of China.
Author information
Authors and Affiliations
Contributions
All authors have contributed significantly to this work. The theoretical derivation was primarily performed by FZ, and the experimental validation was primarily performed by FZ, QK and ZD with support from FH and PC. All authors wrote the manuscript. JZ supervised the whole project.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Willem Waegeman.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A. Proof of augmented likelihood for classification
Substituting Eq. (2) in the paper into the classification likelihood Eq. (1b) in the paper, we can obtain
where the integrand is the augmented likelihood:
Appendix B. Proof of augmented likelihood for Cox process
Substituting Eqs. (2) and (4) in the paper into the product and exponential integral terms respectively in the Cox process likelihood Eq. (1c) in the paper, we can obtain
where the integrand is the augmented likelihood:
Appendix C. Proof of mean-field approximation
The augmented joint distribution can be written as:
Here, we assume the variational posterior \(q(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})=q_1(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )q_2(g,\bar{\varvec{\lambda }})\). To minimize the KL divergence between variational posterior and true posterior, it can be proved that the optimal distribution of each factor is the expectation of the logarithm of the joint distribution taken over variables in the other factor (Bishop, 2006):
Substituting Eq. (C5) into Eq. (C6), we can obtain the optimal variational distributions. The process of deriving variational posteriors for \(\varvec{\omega }^c\), \(\varvec{\omega }^p\), \(\Pi\), and \(\bar{\varvec{\lambda }}\) is similar to that in Donner and Opper (2018). The primary distinction lies in the treatment of the latent function g. Further details are provided below.
1.1 The optimal density for Pólya-Gamma latent variables
The optimal variational posteriors of \(\varvec{\omega }^c\) and \(\varvec{\omega }^p\) are
where \(\tilde{g}^\cdot _{i,n}=\sqrt{\mathbb {E}[{g^\cdot _{i,n}}^2]}\) and we adopt the tilted Pólya-Gamma distribution \(p_{\text {PG}}(\omega \mid b,c)\propto e^{-c^2\omega /2}p_{\text {PG}}(\omega \mid b,0)\) (Polson et al., 2013).
1.2 The optimal intensity for marked Poisson processes
The derivation of optimal variational posterior of \(\Pi =\{\Pi _i\}_{i=1}^{I_p}\) is challenging, so we provide some details below. After taking expectation, we can obtain
where \(\bar{\lambda }^1_i=e^{\mathbb {E}[\log \bar{\lambda }_i]}\) and \(\tilde{\Lambda }_i(\textbf{x},\omega )=\bar{\lambda }^1_i p_{\text {PG}}(\omega \mid 1,0)\). The second line of Eq. (C8) used Campbell’s theorem \(\mathbb {E}_{\Pi _i}\left[ \exp {\left( \sum _{(\textbf{x},\omega )\in \Pi _i}h(\textbf{x},\omega )\right) }\right] =\exp {\left[ \iint \left( e^{ h(\textbf{x},\omega )}-1\right) \tilde{\Lambda }_i(\textbf{x},\omega )d\omega d\textbf{x}\right] }\). It is easy to see the posterior intensity of \(\Pi _i\) is
where we adopt \(e^{-c^2\omega /2}p_{\text {PG}}(\omega \mid b,0)=2\,s(-c)e^{c/2}p_{\text {PG}}(\omega \mid b,c)\) (Polson et al., 2013), \(\tilde{g}_i^p(\textbf{x})=\sqrt{\mathbb {E}[{g_i^p(\textbf{x})}^2]}\), \(\bar{g}_i^p(\textbf{x})=\mathbb {E}[g_i^p(\textbf{x})]\).
1.3 The optimal density for intensity upper-bounds
The optimal variational posterior of \(\bar{\varvec{\lambda }}\) is
where \(p_{\text {Ga}}\) is Gamma density, \(R_i=\int _\mathcal {X}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega d\textbf{x}\), \(|\mathcal {X}|\) is the domain size.
1.4 The optimal density for latent functions
The derivation of optimal variational posterior of g is challenging, so we provide some details below. After taking expectation, we can obtain
where \(A_i(\textbf{x})=\sum _{n=1}^{N^p_i}\mathbb {E}[\omega _{i,n}^p]\delta (\textbf{x}-\textbf{x}^p_{i,n})+\int _0^\infty \omega \Lambda _i^1(\textbf{x},\omega )d\omega\) and \(B_i(\textbf{x})=\frac{1}{2}\sum _{n=1}^{N^p_i}\delta (\textbf{x}-\textbf{x}^p_{i,n})-\frac{1}{2}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega\).
The computation of Eq. (C11) suffers from a cubic complexity w.r.t. the number of data points in regression, classification and point process tasks. We use the inducing inputs formalism to make the inference scalable. We denote M inducing inputs \([\textbf{x}_1\,\ldots ,\textbf{x}_M]^\top\) on the domain \(\mathcal {X}\) for each task. The function values of basis function \(f_q\) at these inducing inputs are defined as \(\textbf{f}_{q,\textbf{x}_m}\). Then we can obtain the function values of task-specific latent function \(g_i\) at these inducing inputs \(\textbf{g}_{\textbf{x}_m}^i=\sum _{q=1}^Qw_{i,q}\textbf{f}_{q,\textbf{x}_m}\). If we define \(\textbf{g}_{\textbf{x}_m}=[\textbf{g}^{\top }_{1,\textbf{x}_m},\ldots ,\textbf{g}^{\top }_{I,\textbf{x}_m}]^\top\), \(\textbf{g}_{\textbf{x}_m}\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m})\) where \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}\) is the MOGP covariance on \(\textbf{x}_m\) for all tasks and \(\textbf{g}_{\textbf{x}_m}^i\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m}^i)\) where \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}^i\) is i-th diagonal block of \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}\). Given \(\textbf{g}_{\textbf{x}_m}^i\), we assume the function \(g_i(\textbf{x})\) is the posterior mean function \(g_i(\textbf{x})=\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{g}_{\textbf{x}_m}^i\) where \(\textbf{k}_{\textbf{x}_m \textbf{x}}^{i}\) is the kernel w.r.t. inducing points and predictive points for i-th task. Therefore, \(\{g_{i,n}^r\}_{i=1}^{I_r}\), \(\{g_{i,n}^c\}_{i=1}^{I_c}\) and \(\{g_i^p(\textbf{x})\}_{i=1}^{I_p}\) can be written as
where \(\textbf{g}_{i}^r=[g_{i,1}^r,\ldots ,g_{i,N_i^r}^r]^\top\), \(\textbf{g}_{i}^c=[g_{i,1}^c,\ldots ,g_{i,N_i^c}^c]^\top\), \(g_i^p(\textbf{x})\) is the function value of \(g_i^p\) on \(\textbf{x}\).
Substituting Eq. (C12) into Eq. (C11), we obtain the inducing points version of Eq. (C11):
It is easy to see the third line of Eq. (C13) is a multivariate Gaussian distribution of \(\textbf{g}_{\textbf{x}_m}^{p,i}\). The likelihoods of \(\textbf{g}_{\textbf{x}_m}^{r,i}\) for regression, \(\textbf{g}_{\textbf{x}_m}^{c,i}\) for classification and \(\textbf{g}_{\textbf{x}_m}^{p,i}\) for point process tasks are all Gaussian distributions, so they are conjugate to the MOGP prior and we can obtain the closed-form variational posterior for \(\textbf{g}_{\textbf{x}_m}\):
where \(\textbf{g}_{\textbf{x}_m}=[\textbf{g}_{\textbf{x}_m}^{r\top },\textbf{g}_{\textbf{x}_m}^{c\top }, \textbf{g}_{\textbf{x}_m}^{p\top }]^\top\), \(\textbf{g}_{\textbf{x}_m}^{\cdot }=[\textbf{g}_{1,\textbf{x}_m}^{\cdot \top },\ldots , \textbf{g}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top\) and
where \(\textbf{H}_{\textbf{x}_m}^\cdot =\text {diag}(\textbf{H}_{1,\textbf{x}_m}^\cdot ,\ldots , \textbf{H}_{I_\cdot ,\textbf{x}_m}^\cdot )\), \(\textbf{v}_{\textbf{x}_m}^\cdot =[\textbf{v}_{1,\textbf{x}_m}^{\cdot \top },\ldots , \textbf{v}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top\) and
where \(\textbf{D}^r_i=\text {diag}(1/\sigma _i^2)\) and \(\textbf{D}^c_i=\text {diag}(\mathbb {E}[\varvec{\omega }^{c}_i])\).
Appendix D. Multi-class classification
In the paper, we mainly focus on the binary classification problem because each binary classification task corresponds to a single latent function. This setting is consistent with the regression and point process tasks in which each task only specifies a single latent function.
For Z-class classification problem, each task corresponds to Z latent functions. The usual likelihood for multi-class classification is the softmax function:
where \(f_{i,n}^{c,k}=f_{i}^{c,k}(\textbf{x}_n)\), \(\textbf{f}_{i,n}^c=[f_{i,n}^{c,1},\ldots ,f_{i,n}^{c,Z}]^{\top }\), \(k\in \{1,\ldots ,Z\}\). However, the Pólya-Gamma augmentation technique for binary classification can not be directly employed in the softmax function. Galy-Fajou et al. (2020) and Snell and Zemel (2021) proposed the logistic-softmax function and the one-vs-each softmax approximation respectively that enable us to employ Pólya-Gamma augmentation to obtain a conditionally conjugate model for multi-class classification tasks. Both methods mentioned above can be incorporated into our framework in the multi-class classification scenario. We refer the readers to Galy-Fajou et al. (2020); Snell and Zemel (2021) for more details.
Appendix E. Comparison with HetMOGP
One anonymous reviewer point out that an important baseline to compare against is Moreno-Muñoz et al. (2018) that can also handle regression, classification and counting data, even if the discretized Poisson distribution likelihood is used instead of the continuous point process likelihood considered in this work. Moreno-Muñoz et al. (2018) used the generic variational inference method mentioned in the introduction for parameter posterior, so this comparison can demonstrate the advantage of using data augmentation for conjugate operations.
We compare the performance of TLL and RT for HMGCP and heterogeneous multi-output Gaussian process (HetMOGP) (Moreno-Muñoz et al., 2018) on the synthetic data from Sects. 5.1 and 5.2. Since HetMOGP can only handle discrete count data, we discretize the original observation window [0, 100] into 100 bins and then calculate the number of points in each bin separately. We use the default hyperparameter settings in the demo code provided by Moreno-Muñoz et al. (2018). The results are shown in Tables 4 and 5. From Tables 4 and 5, we can see that HMGCP has the better TLL than HetMOGP that is trained on the discrete count data. For a fair comparison of efficiency, we run both HMGCP and HetMOGP on all tasks, and our inference is much faster than HetMOGP. This is because, for HetMOGP, it uses the generic variational inference, so the numerical optimization has to be performed during the variational iterations; while for our model HMGCP, the variational iterations have completely analytical expressions due to data augmentation, so it leads to the more efficient computation. It is worth noting that the running times presented in Tables 4 and 5 encompass all tasks (regression, classification, and Cox processes), resulting in longer duration compared to those reported in Sects. 5.1 and 5.2, which are solely based on the Cox process tasks.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, F., Kong, Q., Deng, Z. et al. Heterogeneous multi-task Gaussian Cox processes. Mach Learn 112, 5105–5134 (2023). https://doi.org/10.1007/s10994-023-06382-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06382-1