1 Introduction

The relationship between multiple response variables and a set of predictors has been a topic of ongoing research and interest in the literature, with numerous studies dedicated to understanding and exploring this connection. One area of particular interest in this field is the use of reduced rank regression, which involves using a low-rank constraint to linearly connect the response variables and the predictors, see e.g. Izenman (2008), Cook (2018), Reinsel et al. (2023) and Giraud (2021). This approach has been widely studied and applied, with numerous works published on the topic, including those by Anderson (1951), Izenman (1975), Bunea et al. (2011), Geweke (1996), and many others. From frequentist to Bayesian approaches, there have been a wide range of methods and techniques employed to analyze and model these relationships, Corander and Villani (2004), Chakraborty et al. (2020), Alquier (2013), Goh et al. (2017), Yang et al. (2020), Kleibergen and Paap (2002), Chen et al. (2013), She and Chen (2017).

However, despite the extensive research in this area, most studies have focused on real-valued responses. In many applications, the entries of the response matrix are binary, that is, they are in the set \( \{ -1,1\} \). For example, the treatment responses from multiple drugs can be recoded as binary or categorical when measured from each cell line, as seen in studies by Hayes et al. (2006), Greenlund et al. (2005), Mishra and Müller (2022). This highlights the need for further research on the use of binary or categorical response variables in reduced rank regression and other multivariate modeling techniques.

The problem of modeling multiple binary response variables has received some limited attention in recent years, with few studies proposing reduced-rank regression models as a solution. One notable example is the paper by Luo et al. (2018), which proposed a mixed-outcome reduced-rank regression model to handle response matrices that include both binary and count data, and also addressed the issue of missing data in the responses. Another recent study, carried out independently of Luo et al. (2018), is the paper by Park et al. (2022), which additionally considered row-wise sparse constraints in addition to the low-rank assumption, but only for fully observed binary response matrices. The main idea behind these studies is to assume a marginal logistic regression model to relate the binary response and the covariates, and then employ a penalized maximum likelihood method.

However, the studies in Luo et al. (2018) and Park et al. (2022) primarily focus on recovering the parameter matrix of interest, and provide estimation error rates for their estimators. While their results demonstrate that their estimators are consistent when estimating a low-rank matrix, they do not provide any guarantee on the prediction error or misclassification error. This highlights the need for further research in this area, particularly in terms of developing methods that not only accurately estimate the parameter matrix, but also provide guarantees on the performance of predictions and classification.

One of the key challenges in addressing the problem of multiple binary responses is the lack of robust and generalizable models that can accurately predict and classify the outcomes. In this paper, we aim to address this gap by taking a machine learning approach and adopting a classification-based method to deal with binary output. Unlike traditional methods that rely on parametric models, we will consider a set of prediction matrices and seek to find the one that yields the best prediction error. This approach is built on the principles of statistical learning theory (Vapnik 1998), where the zero–one loss is used as a measure of prediction error, and the risk of the classifier is controlled by a PAC (probably approximately correct) bound. However, the non-convex nature of the zero–one loss function makes it computationally intractable. An alternative is the use of a convex surrogate such as the hinge loss, which was introduced in Zhang (2004), has been shown to be effective in a variety of machine learning tasks and has the added benefit of being computationally efficient. By leveraging this method, we aim to provide a robust and generalizable model for predicting and classifying multiple binary responses.

In this work, we propose a novel approach to addressing the problem of multiple binary responses that combines elements of both Bayesian and machine learning methodologies. Specifically, we propose a pseudo-Bayesian approach that utilizes a notion of risk based on the hinge loss, rather than relying on a likelihood function. Our approach is based on the principles of PAC-Bayesian theory (Shawe-Taylor and Williamson 1997; McAllester 1998; Herbrich and Graepel 2002; Catoni 2007; Dalalyan and Tsybakov 2008; Seldin et al. 2012; Alquier et al. 2016; Seldin and Tishby 2010; Germain et al. 2015), which provides theoretical guarantees on the prediction and misclassification error for our method. It is worth mentioning that using loss functions in replacing the likelihood is becoming popular in the so-called generalized Bayesian inference in recent years as documented for example in Matsubara et al. (2022), Jewson and Rossell (2022), Yonekura and Sugasawa (2023), Medina et al. (2022), Grünwald and Van Ommen (2017), Bissiri et al. (2016), Lyddon et al. (2019), Syring and Martin (2019).

The use of a hinge loss-based risk function allows us to overcome some of the limitations of traditional likelihood-based Bayesian models, particularly when dealing with binary response variables. Unlike traditional likelihood functions, which can be difficult to model and computationally intensive to compute, the hinge loss function is convex and can be easily optimized. To further improve the efficiency and practicality of our proposed approach, we also develop an efficient gradient-based sampling method based on Langevin Monte Carlo. This method allows us to approximate the computation of our proposed method, making it more computationally tractable and suitable for large-scale applications. Overall, our proposed approach offers a novel and promising approach to modeling and predicting multiple binary responses, providing both theoretical guarantees and practical computational methods for its implementation.

Similar to the approach proposed in Luo et al. (2018), our proposed method has the capability to handle incomplete response matrices. This is achieved by extending our approach to account for missing data in the response matrix, which allows for greater flexibility and applicability of our method. The extension of our method to handle missing data is relatively simple and does not introduce any additional complexity to the overall approach. This allows for our method to be applied to a wider range of datasets with incomplete or missing data. This capability is especially useful in real-world applications, where missing data is often present and traditional methods may struggle to handle such cases effectively. For example, in studies involving medical treatments, it is not uncommon for certain patients to drop out of the study, resulting in missing data in the response matrix. Our proposed method allows for the inclusion of such missing data, providing a more comprehensive and realistic analysis of the treatment outcomes. Additionally, in observational studies, missing data can be a common problem due to various factors such as non-response, measurement error or data collection issues. Our proposed method’s ability to handle missing data would be beneficial in these scenarios as well.

The remainder of the paper is structured as follows. Section 2 provides a detailed description of the problem statement and presents our proposed method, along with its theoretical results. An extension to handle incomplete response data is also discussed in this section. In Sect. 3, we describe the Langevin Monte Carlo method used to compute our proposed method and present numerical studies on both synthetic and real datasets to demonstrate its performance. Conclusions and discussions are given in Sect. 4. All technical proofs are gathered in Appendix A.

2 Problem and method

2.1 Problem statement

We formally consider the following multiple binary responses with a set of common covariates problem: for units \( i = 1,\ldots , n \) with covariate vectors \( x_i \in {\mathbb {R}}^p \), there exist q binary responses \( y^k_{i} \in \{-1,1\} \) for \( k = 1, \ldots , q \). From a classification perspective, it would be natural to use a linear predictor as a function from \( {\mathbb {R}}^p \) to \( \{-1,1\} \) in the following way: when \( {{\varvec{x}}}_i \) is revealed, \(M^{(k)} \in {\mathbb {R}}^p \) predicts \( {{\varvec{y}}}_{i}^k \) by \(\textrm{sign}({{\varvec{x}}}_i M^{(k)})\).

For the matrix notation, let’s define the binary response matrix \( Y = [ y^1, \ldots , y^q ] \in \{-1,1\}^{n\times q} \) and the covariate matrix \( X = [x_1^\top , \ldots , x_n^\top ]^\top \in {\mathbb {R}}^{n\times p} \). With multiple responses, the predictors can be written in matrix-form as \( M = [ M^{(1)}, \ldots , M^{(q)} ] \in {\mathbb {R}}^{p\times q} \).

The ability of the predictor to predict a new entry of the matrix is then assessed by the risk

$$\begin{aligned} R(M) = {\mathbb {E}} \left[ \mathbbm {1}_{( Y_{11} \cdot (X M)_{11} < 0)}\right] , \end{aligned}$$

and its empirical counterpart is:

$$\begin{aligned} r(M) = \frac{1}{nq}\sum _{i=1}^n \sum _{j=1}^q \mathbbm {1}_{(Y_{ij} (XM)_{ij}<0)} . \end{aligned}$$

From the standard approach in classification theory (Vapnik 1998; Devroye et al. 1996), the best possible classifier is the Bayes classifier, \( M^B \), such that

$$\begin{aligned} R(M^B)= \inf _M R(M). \end{aligned}$$

The anticipated property of the Bayes matrix, \( M^B \), is that it exhibits a low-rank structure or can be effectively approximated by a low-rank matrix.

For the sake of simplicity, we put \( {\overline{R}} = R(M^B) \) and \({\overline{r}}=r(M^B)\). The goal is to find an estimator \( {\hat{M}} \) that yields the minimal excess risk \( R({\hat{M}}) -{\overline{R}} \).

While the risk R(M) has a clear interpretation, working with its empirical counterpart r(M) is challenging as it is non-smooth and non-convex. A common approach to overcome this issue is to replace the empirical risk with a convex surrogate (Zhang 2004). In this paper, we primarily focus on the hinge loss, which results in the following hinge empirical risk:

$$\begin{aligned} r^h(M) = \frac{1}{nq}\sum _{i=1}^n \sum _{j=1}^q ( 1 - Y_{ij} (XM)_{ij})_+ \, , \end{aligned}$$

where \( (a)_+:= \max (a,0),\forall a \in {\mathbb {R}} \).

2.2 Estimation procedure

Building upon the work previously done in the field of PAC-Bayesian theory, we define the pseudo-posterior distribution as follows:

$$\begin{aligned} {\widehat{\rho }}_\lambda (M) \propto \exp [-\lambda r^h(M)] \pi (M) \end{aligned}$$

where \(\lambda >0\) is a tuning parameter that will be discussed later and \(\pi (M)\) is a prior distribution, given in (1), that promotes (approximately) low-rankness on the parameter matrix M.

The term \( {\widehat{\rho }}_\lambda \), known as the Gibbs posterior, can be interpreted as the posterior distribution under a Bayesian framework, where \(\pi \) represents the prior distribution for the parameter M. However, this Bayesian interpretation is not essential for understanding the approach, as it relies on the proportionality of \(\exp [-\lambda r^h(M)]\) to a likelihood function. The motivation behind defining \( {\widehat{\rho }}_\lambda \) stems from the minimization problem in Lemma 1 rather than Bayesian principles. It is not necessary to have a likelihood function or a complete model; only the empirical risk based on the hinge loss function is required.

Nonetheless, we still refer to \(\pi \) as the prior and \( {\widehat{\rho }}_\lambda \) as the pseudo-posterior. The measure \( {\widehat{\rho }}_\lambda \) can be seen as an adjusted version of \(\pi \). Comparing two parameters, \(m_1\) and \(m_2\), if \( r^h(m_1) < r^h(m_2) \), then \(\exp [-\lambda r^h(m_1)] > \exp [-\lambda r^h(m_2)] \) for any \( \lambda >0 \). This implies that, relative to \(\pi \), \( {\widehat{\rho }}_\lambda \) assigns more weight to \( m_1 \) than to \( m_2 \). The adjustment in the distribution thus favors the parameter value that results in a smaller in-sample hinge empirical risk. The tuning parameter \( \lambda \) controls the degree of adjustment. The choice of \( \lambda \) will be further explored in subsequent sections. This pseudo-Bayesian approach has been previously studied in various low-matrix estimation problems, such as Alquier (2013), Mai and Alquier (2017), Mai and Alquier (2015), Cottet and Alquier (2018), Mai (2023).

In this work, we have opted to use a spectral scaled Student prior distribution, as follows, with a parameter \(\tau >0\),

$$\begin{aligned} \pi (M) \propto \det (\tau ^2 {{\textbf{I}}}_{p} + MM^\intercal )^{-(p+q+2)/2}. \end{aligned}$$
(1)

This prior can induce low-rankness of matrices M, as it can be verified that \( \pi (M) \propto \prod _{j=1}^{p} (\tau ^2 + s_j(M)^2 )^{- (p+q+2)/2 } \), where \( s_j(M) \) denotes the \(j^{th}\) largest singular value of M. It means that this prior follows a scaled Student distribution evaluated at \( s_j(M) \) which induces approximately sparsity on the \(s_j(M)\) (Dalalyan and Tsybakov 2012). Thus, under this prior distribution, most of the \(s_j(M)\) are close to 0 and that M is approximately low-rank. This prior has been used before in image denoising (Dalalyan 2020), bilinear regression (Mai 2023; Yang et al. 2018) for matrix completion. Even though this prior distribution is not conjugate in our problem, it is advantageous to utilize the Langevin Monte Carlo, a sampling method that relies on gradients for implementation purposes.

2.3 Theoretical results

In this work, we make use of the Mammen and Tsybakov’s margin assumption in Mammen and Tsybakov (1999).

Assumption 1

(Margin assumption) We assume that there is a constant \(C \ge 1 \) such that:

$$\begin{aligned} {\mathbb {E}}\left[ \left( \mathbbm {1}_{Y (XM)\le 0} - \mathbbm {1}_{Y (XM^B)\le 0} \right) ^2\right] \le C[R(M)-{\overline{R}}]. \end{aligned}$$

As an example, in the noiseless case where \(Y = \textrm{sign} (XM^B) \) almost surely, we have that

$$\begin{aligned}{} & {} {\mathbb {E}}\left[ \left( \mathbbm {1}_{Y (XM)\le 0} - \mathbbm {1}_{Y (XM^B)\le 0} \right) ^2\right] \\{} & {} \quad = {\mathbb {E}}\left[ \mathbbm {1}_{Y (XM) \le 0}^2\right] = {\mathbb {E}}\left[ \mathbbm {1}_{Y (XM) \le 0}\right] = R(M) \\{} & {} \quad = R(M)-{\overline{R}}. \end{aligned}$$

Thus, the margin assumption is satisfied with \(C=1 \).

We now present a theoretical bound on the expected risk for a random estimator of M generated from the pseudo-posterior \( {\widehat{\rho }}_\lambda (M) \).

Theorem 1

Assume that Assumption 1 is satisfied and put \( r^* = \textrm{rank} (M^B ) \) and with \( \tau ^2 = (p+q)/(2q^2pn \Vert X\Vert _{_F}^2 ) \). Then, for any \(\epsilon \in (0,1) \) and for \(\lambda = 2 nq/(3C + 2) \), \( \varsigma \in (0,1) \), with probability at least \(1-2\epsilon \),

$$\begin{aligned}&\int R d{\widehat{\rho }}_\lambda \\&\quad \le 2.5{\overline{R}} + \Xi _{C,\varsigma }\\&\qquad \times \frac{ r^* (q+p+2) \log \left( 1+ \frac{q\Vert X\Vert _{_F} \Vert M^B \Vert _{_F} \sqrt{np} }{ \sqrt{(p+q) r^* }} \right) + \log (1/\epsilon ) }{nq}, \end{aligned}$$

where \( \Xi _{C,\varsigma }\) is a known constant that depends only on \(\varsigma ,C \).

The proof of the above theorem is given in Appendix A. The technical argument used in the proof is known as “PAC-Bayesian bounds", introduced in Shawe-Taylor and Williamson (1997), McAllester (1998) as a way to provide empirical bounds on the prediction risk of Bayesian-type estimators. However, it is well known that the PAC-Bayesian approach also comes with a set of powerful technical tools to establish non-asymptotic bounds as documented in Catoni (2003, 2004, 2007) that have been explored in this paper. For an in-depth exploration of PAC-Bayes bounds, including recent surveys and advancements, readers are encouraged to refer to the following references Guedj (2019) and Alquier (2021).

Remark 1

It is important to mention that the result of the above theorem has an adaptive characteristic in the sense that the estimator does not depend on the rank \( r^* = \textrm{rank}(M^B) \). When the rank \( r^* \) is small, the prediction error will be similar to the Bayes error, \( {\overline{R}} \), even with a small sample size. This type of result is commonly referred to as an ‘oracle inequality’ as it suggests that our estimator performs as well as if we had access to the rank of \( M^B \) through an oracle. Additionally, it is noteworthy that \( r^* \ne 0 \) is not a necessary condition in the above formula. If \(r^* =0 \), we interpret \(0\log (1+0/0) \) as 0.

Corollary 2

In the case that \(Y=\textrm{sign}(XM^B)\) a.s., for any \( \epsilon \in (0,1) \) and for \(\lambda = 2 nq/5 \), with probability at least \(1-2\epsilon \),

$$\begin{aligned}&\int R d{\widehat{\rho }}_\lambda \nonumber \\&\quad \le \Xi _{1,\varsigma }' \frac{ r^* (q+p+2) \log \left( 1+ \frac{q\Vert X\Vert _{_F} \Vert M^B \Vert _{_F} \sqrt{np} }{ \sqrt{(p+q) r^* }} \right) + \log (1/\epsilon ) }{nq} \end{aligned}$$
(2)

where \( \Xi _{C,\varsigma }' = \Xi _{1,\varsigma } \).

Remark 2

The Theorem 1 and Corollary 2 presented in this study offer novel perspectives on the prediction error, which complement the previously established theoretical results on estimation errors in Luo et al. (2018) and Park et al. (2022). These theoretical inequalities allow for the comparison of the out-of-sample error of our predictor to the optimal one, and demonstrate that the prediction error rate is \( r^* \max (q,p)/nq \), with logarithmic terms included. This is a noteworthy contribution to the field as it provides a comprehensive understanding of the relationship between the rank of \(M^B\) and the prediction error.

It is worth mentioning that the utilization of Assumption 1 plays a crucial role in achieving a ’fast’ prediction rate, as demonstrated in Theorem 1. The initial introduction of this assumption was made in Mammen and Tsybakov (1999) for classification purposes, and it has since been adopted for ranking tasks in subsequent works such as Clémençon et al. (2008), Robbiano (2013), Ridgway et al. (2014). In the forthcoming proposition, we present a slower rate result without relying on the usage of Assumption 1. The proof is also given in Appendix A.

Proposition 1

Put \( r^* = \textrm{rank} (M^B ) \) and with \( \tau ^2 = (p+q)/(2q^2pn \Vert X\Vert _{_F}^2 ) \). Then, for any \(\epsilon \in (0,1) \) and for \(\lambda = 2\sqrt{nq/(p+q+2)} \), \( \varsigma \in (0,1) \), with probability at least \(1-2\epsilon \),

$$\begin{aligned}&\int R d{\widehat{\rho }}_\lambda \\&\quad \le 2{\overline{R}} + \Psi _{\varsigma }\\&\qquad \times \frac{ r^* \sqrt{(q+p+2)} \log \left( 1+ \frac{q\Vert X\Vert _{_F} \Vert M^B \Vert _{_F} \sqrt{np} }{ \sqrt{(p+q) r^* }} \right) + \log (1/\epsilon ) }{\sqrt{nq}}, \end{aligned}$$

where \( \Psi _{\varsigma }\) is a known constant depending only on \(\varsigma \).

2.4 Dealing with imcomplete responses

As previously mentioned in the introduction, the method proposed in Luo et al. (2018) has the capability to handle incomplete response data. Similarly, our proposed approach can also be easily and naturally extended to address the scenario where the response matrix Y contains missing data. The ability to handle missing data is an important consideration, as it is a common issue in many real-world datasets. This can be especially useful in cases where data collection is difficult or expensive, as it allows for the use of all available data, rather than discarding observations with missing data.

Let \( \Omega = \{ (i,k): y_{ik} \,\, \mathrm{is \,\, observed } \, i \in \{ 1, \ldots , n\}, k\in \{1,\ldots , q \} \} \) be the index set of the observed entries of the binary response matrix Y. Here, we have that \( |\Omega | =m < nq \). We assume that we observe a design matrix X and m i.i.d random pairs \( ({\mathcal {O}}_1, Y_1), \ldots , ({\mathcal {O}}_m, Y_m) \). The variables \( {\mathcal {O}}_i \) are i.i.d copies of a random variable \( {\mathcal {O}} \) having distribution on the set \( \lbrace 1, \ldots , n \rbrace \times \lbrace 1, \ldots , p \rbrace \).

The risk in this case is given as

$$\begin{aligned} R(M) = {\mathbb {E}} \left[ \mathbbm {1}_{( Y_{1} \cdot (X M)_{{\mathcal {O}}_1} < 0)}\right] , \end{aligned}$$

and its empirical counterpart is:

$$\begin{aligned} r_m(M) = \frac{1}{m}\sum _{i=1}^m \mathbbm {1}_{(Y_{i} (XM)_{{\mathcal {O}}_i }<0)} . \end{aligned}$$

The hinge empirical risk is now as

$$\begin{aligned} r^h_m(M) = \frac{1}{m}\sum _{i=1}^m (1 - Y_{i} (XM)_{{\mathcal {O}}_i })_{_+} \, . \end{aligned}$$

The subsequent theorem establishes a theoretical bound that links the integrated risk of the estimator to the minimum attainable risk achieved by the Bayes classifier, \( M^B \).

Theorem 3

Assume that Assumption 1 is satisfied and put \( r^* = \textrm{rank} (M^B ) \). Then, for any \(\epsilon \in (0,1)\) and for \(\lambda = 2\,m/(3C + 2) \), \( \tau ^2 = (p+q)/(2qpm \Vert X\Vert _{F}^2 ) \), \( \upsilon \in (0,1) \), with probability at least \(1-2\epsilon \),

$$\begin{aligned}&\int R d{\widehat{\rho }}_\lambda \\&\quad \le 2.5{\overline{R}} + \Xi '_{C,\upsilon }\\&\qquad \times \frac{ r^* (q+p+2) \log \left( 1+ \frac{q\Vert X\Vert _{_F} \Vert M^B \Vert _{_F} \sqrt{mp} }{ \sqrt{(p+q) r^* }} \right) + \log (1/\epsilon ) }{m} \end{aligned}$$

where \( \Xi '_{C,\upsilon }\) is known constant that depends only on the \(\upsilon ,C\).

Remark 3

The message of Theorem 3 lies in its ability to give a finite sample bound and to demonstrate that our estimate remains effective in a non-asymptotic scenario when the intrinsic dimension (i.e., the rank) is relatively small in relation to m. To clarify this point, consider the scenario where p is a function of m that increases as m increases. In this scenario, a traditional asymptotic approach would not provide useful information, but our bounds still provide valuable insights, as long as the rank of the parameter matrix is sufficiently small in relation to m.

The selection of \( \lambda \) in our results is based on optimizing an upper bound on the risk R (as presented in the proofs of the theorems, given in Appendix A). However, it is important to keep in mind that this choice may not always be the most suitable option in practice, even though it provides a reliable estimate of the magnitude of \( \lambda \). To ensure optimal performance, it is recommended to use cross-validation to adjust the temperature parameter correctly.

3 Numerical studies

3.1 Implementation and compared methods

3.1.1 Implementation

In this section, we introduce the use of the Langevin Monte Carlo (LMC) algorithm as a method for sampling from the (pseudo) posterior. The LMC algorithm is a gradient-based method for sampling from a distribution.

First, a constant step-size unadjusted LMC algorithm, as described in Durmus and Moulines (2019), is proposed. The algorithm starts with an initial matrix \(M_0\) and uses the recursion:

$$\begin{aligned} M_{k+1} = M_{k} - h\nabla \log {\widehat{\rho }}_{\lambda }(M_k) +\sqrt{2h}N_k\qquad k=0,1,\ldots \nonumber \\ \end{aligned}$$
(3)

where \(h>0\) is the step-size and \( N_0, N_1,\ldots \) are independent random matrices with i.i.d standard Gaussian entries. When the step-size h is not small enough, the sum can explode Roberts and Stramer (2002), so a Metropolis-Hastings (MH) correction is included in the algorithm. This guarantees convergence to the desired distribution, but slows down the algorithm due to the additional acception/rejection step at each iteration.

The update rule in (3) is now considered as a proposal for a new candidate,

$$\begin{aligned} {{\tilde{M}}}_{k+1} = M_{k} - h\nabla \log {\widehat{\rho }}_{\lambda } (M_k) +\sqrt{2h},N_k, \,\,\, k=0,1,\ldots , \end{aligned}$$
(4)

This proposal is accepted or rejected according to the MH algorithm with probability:

$$\begin{aligned} \min \left\{ 1, \frac{ {\widehat{\rho }}_{\lambda } ({{\tilde{M}}}_{k+1}) q(M_k | {{\tilde{M}}}_{k+1}) }{ {\widehat{\rho }}_{\lambda } (M_k ) q({{\tilde{M}}}_{k+1} | M_k ) } \right\} , \end{aligned}$$

where \( q(x' | x) \propto \exp \left( - \Vert x'-x + h\nabla \log {\widehat{\rho }}_{\lambda } (x) \Vert ^2_F / (4\,h) \right) \) is the transition probability density from x to \(x'\). This is known as the Metropolis-adjusted Langevin algorithm (MALA). It guarantees the convergence to the (pseudo) posterior and provides a way to choose the step-size h. Unlike random-walk MH, MALA usually proposes moves into regions of higher probability, which are more likely to be accepted. The step-size h for MALA is chosen such that the acceptance rate is approximate 0.5 following Roberts and Rosenthal (1998), while the step-size for LMC in the same setting is chosen smaller than for MALA.

3.1.2 Compared methods

We will assess the effectiveness of our proposed methods by comparing them to Bayesian approaches that rely on logistic regression. More specifically, it is now assumed that \( Y|X = 1 \) with probability f(XM) and \( Y|X = -1 \) with probability \( 1-f(XM) \), where \( f(\cdot ) \) is the link function, \( f(x) = e^x/(1+e^x) \). In this case the pseudo-likelihood \( \exp (-\lambda r^\ell (M)) \), with \( \lambda = nq \), is exactly equal to the likelihood of the logistic model. Here,

$$\begin{aligned}{} & {} r^\ell (M) = \sum _{ij} \textrm{logit}(YXM_{ij})/(nq), \quad \text {and}\\{} & {} \quad \textrm{logit}(u) = \log (1+e^{-u} ) \end{aligned}$$

is the logistic loss. The prior distribution is exactly the same as in previous sections.

As studied in Zhang (2004), the logistic loss can serve as a convex alternative to the hinge loss for approximating the 0-1 loss. However, it is worth noting that employing the logistic loss may lead to a slower convergence rate compared to the hinge loss (Zhang 2004).

In this study, we evaluate the performance of our proposed methods, LMC-H and MALA-H, in comparison to three other alternatives: (1) LMC-logit, (2) MALA-logit (methods based on Bayesian logistic regression) and (3) the current state-of-the-art method mRRR. The mRRR method, which was proposed in Luo et al. (2018), is a frequentist approach and its implementation can be found in the R package rrpack Chen et al. (2022).

Table 1 Summary of simulation settings
Table 2 Misclassification error

3.2 Simulation studies

We consider different scenarios of data generation to assess the performance of our method. The sample size is fixed as \( n=100 \), while we vary the dimension of the covariates and responses as \( q=8,p=12 \) and \( q=20,p=50 \). The entries of the covariate matrix X are simulated from \( {\mathcal {N}} (0,1) \). More specifically, we consider two scenarios for the true parameter matrix \( M^* \):

  • First, it is a rank-2 matrix that is a product of two rank-2 matrices, i.e \( M^* = A_{p\times 2}B_{q\times 2}^\top \), where A’s and B’s entries are iid drawn from \( {\mathcal {N}} (0,1) \);

  • Second, it is an approximate rank-2 matrix. To create an approximate rank-2 matrix, we first simulate a rank-2 matrix \( M' \) as before and then add some noise as \( M^* = 2M' + N \), where entries of N are simulated from \( {\mathcal {N}} (0,0.1) \).

Then we consider the following settings to obtain the responses:

  • Setting I

    $$\begin{aligned} Y = \textrm{sign}( XM^* + E ) B. \end{aligned}$$
  • Setting II With \( u = XM^* + E, \) put \( p= \exp (u)/ (1 +\exp (u) ) \):

    $$\begin{aligned} Y_{ij} \sim \text {Binomial}(p_{ij}). \end{aligned}$$

Here, the noise term (EB) is varied in different scenarios which lead to different setup in each setting. It is summarized in Table 1.

The LMC, MALA are run with 30000 iterations and we take the first 1000 steps as burn-in. We set the values of tuning parameters \(\lambda \) and \(\tau \) to 1 for all scenarios. It is important to acknowledge that a better approach would be to tune these parameters using cross validation, which could lead to improved results. The mRRR method is run with default options and that 5-fold cross validation is used to select the rank.

Each simulation setting is repeated 100 times and we report the averaged results. The results of the simulations study are detailed in Tables 2, 3, 4 and 5 and the values within parentheses indicate the standard deviations associated with each misclassification error percentage.

Table 3 Misclassification error
Table 4 Misclassification error
Table 5 Misclassification error

Among the methods mentioned, the MALA algorithm with the hinge loss consistently demonstrates the lowest misclassification error rate. In fact, in some instances, it even outperforms the frequentist mRRR method by a margin of two times. This highlights the effectiveness of the proposed method, which combines the MALA algorithm and the hinge loss, for achieving accurate classification results.

The other methods that employ the logistic loss function, such as MALA-logit and the LMC algorithm-based approaches (LMC-H and LMC-logit), exhibit similar performance to the mRRR method. These methods are generally comparable in terms of misclassification error rates. However, it is worth noting that the MALA-logit approach, which also utilizes the MALA algorithm, shows superiority to the mRRR method in cases involving low-rank matrices and higher dimensions.

While the LMC-based methods (LMC-H and LMC-logit) perform well, they are not significantly better than the mRRR method. However, these approaches are noted to be more efficient in handling larger data sets, which can be advantageous in certain scenarios.

It’s important to acknowledge that as the proportion of missing data increases to 10% and 30%, the misclassification error percentages also increase. This suggests a decline in the performance of these methods in the presence of missing data. Therefore, addressing and mitigating the effects of missing data is crucial for improving the performance of these methods in real-world applications.

In summary, the MALA algorithm with the hinge loss stands out as the method with consistently lower misclassification error rates. The logistic loss-based methods and the LMC-based methods demonstrate comparable performance to the mRRR method. However, it is necessary to carefully consider the impact of missing data on these methods’ performance and employ appropriate strategies to handle missing data effectively.

3.3 A real data study

In this section, we evaluate the performance of our proposed method on a real data with multiple binary responses. To do this, we utilize the spider data set, which can be found in the R package mvabund Wang et al. (2012). The matrix of covariates, \( X \in {\mathbb {R}}^{28\times 6} \), includes information on 6 environmental features from 28 samples. The response matrix, \( Y \in {\mathbb {R}}^{28\times 12} \), is count data that represents the number of 12 hunting spider species from 28 observations of abundance. We convert the response data into binary format by setting \( y_{ij} = -1 \) if there is no such species surviving in a certain environment and \( y_{ij} = 1 \) otherwise. This results in a total of 154 negative ones (45.8%) and 182 positive ones (54.2%).

The data is divided randomly into two sets: a training set consisting of 23 samples and a test set consisting of 5 samples. We use the training data to run the methods and then evaluate their prediction accuracy based on the test data. This process is repeated 100 times, each time with a different random partition of the training and test data. The results of this procedure are illustrated in Fig. 1. By repeating the procedure multiple times and averaging the results, we can obtain a more robust and accurate assessment of the performance of the methods. This approach allows us to account for any potential variability in the data and obtain a better understanding of the methods’ performance.

Fig. 1
figure 1

Result on real data

Fig. 2
figure 2

Results on real data with missing data in the response matrix. Left: 10% of the entries is removed, middle: 20% of the entries is removed, right: 30% of the entries is removed

The results, from Fig. 1, show that our proposed method, computed using the Metropolis-Adjusted Langevin Algorithm with the hinge loss (’MALA-H’), outperforms the frequentist mRRR method. The prediction errors for MALA-H and mRRR are 15.05% (\( \pm \)5.15%) and 24.96% (\( \pm \)7.85%), respectively. Other approaches, such as those using LMC, also perform well and are slightly better than the mRRR method. The approach MALA-logit, which utilizes the logit loss, is slightly behind MALA-H with a prediction error of 16.47% (\( \pm \)5.63%). This suggests that MALA-H is the most effective method among those tested.

To further evaluate the performance of all the considered methods, we also investigate their performance when missing data is present in the response matrix of the spider real data set. To do this, we randomly remove 10%, 20%, and 30% of the entries in the response matrix. We denote by \( \Omega \) the index set of the observed entries. We then run all the considered methods on the data set with incomplete responses and evaluate the prediction/misclassification error on the set of unobserved entries and this process is repeated 100 times. This allows us to assess how well the methods can handle missing data and make predictions even when certain information is missing. The results of this evaluation are presented in Fig. 2. In general, when examining incomplete response scenarios, the outcomes are comparable to those observed in complete response cases, implying that all the methods under consideration are capable of handling missing data. Furthermore, it is worth noting that MALA-H continues to outperform other methods in dealing with incomplete response situations.

4 Discussion and conclusion

In this paper, we investigated the problem of making predictions for multivariate binary responses using a set of covariates by exploring low-rank predictors through the lens of machine learning. We focused on providing the prediction error rate, which has not been addressed in previous published works. Our approach leverages methods from statistical learning theory for binary classification and does not require any assumptions about the underlying observations. Instead, we focus on a set of predictors and aim to find the one that results in the lowest prediction error. We propose using a pseudo-Bayesian method in this paper, which is also able to handle incomplete response data.

Furthermore, we developed an efficient computational approximation method, based on a gradient-based sampling technique known as Langevin Monte Carlo. By implementing this method, we were able to overcome the computational challenges that are often associated with this type of probabilistic approach, making our proposed approach more practical and applicable in real-world settings. The numerical studies we conducted on simulated and real-world data sets have shown promising results when compared to the state-of-the-art method, further validating the effectiveness of our proposed approach.

Although, our work offers a promising solution for the problem of relating a set covariates to multiple binary response, but there is still room for further exploration and improvement. One area of future research could include extending our method to incorporate variable selection, in order to identify the most important covariates in determining the binary responses. Additionally, future research could also focus on addressing the problem of missing data in the covariate matrix X, which is a common issue in many real-world datasets. This would further improve the robustness and applicability of our proposed method.

An additional aspect that requires further attention in practical applications is the tuning of the learning rate \( \lambda \) and the parameter \( \tau \) in the prior distribution. While we have presented certain values that yield favorable theoretical outcomes, it is important to acknowledge that these choices may not be optimal in practice, although they offer some guidance regarding the magnitude of the tuning parameters. We have also mentioned that cross-validation can be employed in practical scenarios, albeit at the cost of increased computational time. It is worth highlighting that the optimal tuning of these parameters remains a challenging problem in practical settings. In particular, tuning the learning rate \( \lambda \) remains as an open research question that has garnered significant attention within the framework of generalized Bayesian inference, see for example Meunier and Alquier (2021), Wu and Martin (2023) and references there in.

Furthermore, when dealing with practical problems involving huge datasets, it has been observed that LMC algorithms may encounter scalability issues. In order to address this challenge, variational inference (VI) has emerged as a computational optimization-based alternative to Markov chain Monte Carlo techniques. VI has gained popularity for approximating intractable posterior distributions in large-scale Bayesian models due to its comparable effectiveness and superior computational efficiency. The connection between the PAC-Bayesian approach and variational inference has been elucidated in Alquier et al. (2016), while the development of variational inference for matrix completion has been explored in Cottet and Alquier (2018). These references provide a roadmap for future research, aiming to develop a more scalable computational approximation method for our proposed approach.