1 Introduction

Statistical methods for novelty detection are becoming increasingly popular in recent literature. Similar to standard supervised classifiers, these models are trained on a fully labeled dataset and subsequently employed to group unlabeled data. In addition, novelty detectors allow for the discovery and consequent classification of samples showcasing patterns not previously observed in the learning phase. For example, novelty detection models have been successfully employed in uncovering unknown adulterants in food authenticity studies (Cappozzo et al. 2020; Fop et al. 2022), collective anomalies in high energy particle physics (Vatanen et al. 2012), and novel communities in social network analyses (Bouveyron 2014).

Formally, we define a novelty detector as a classifier trained on a set (the training set) characterized by a specific number of classes that is used to predict the labels of a second set (the test set). For a detailed account of the topic, the interested reader is referred to the reviews of Markou and Singh (2003a) and Markou and Singh (2003b), where the former is devoted explicitly to statistical methods for novelty detection.

Within the statistical approach, different methodologies have been proposed to construct novelty detectors. Recently, Denti et al. (2021) introduced a Bayesian Robust Adaptive Novelty Detector (Brand). Brand is a semiparametric classifier divided into two stages, focused on training and test datasets, respectively. In the first stage, a robust estimator is applied to each class of the labeled training set. By doing so, Brand recovers the representative traits of each known class. The second step consists in fitting a semiparametric nested mixture to the test set: the hierarchical structure of the model specifies a convex combination between terms already observed in the training set and a novelty term, where the latter is decomposed into a potentially unbounded number of novel classes. At this point, Brand uses the information extracted from the first phase to elicit reliable priors for the known components.

In their article, Denti et al. (2021) devised an MCMC algorithm for posterior estimation. The algorithm is based on a variation of the slice sampler (Kalli et al. 2011) for nonparametric mixtures, which avoids the ex-ante specification of the number of previously unseen components, reflecting the expected ignorance about the structure of the novelty term. As a result, the algorithm obtains good performance as it targets the true posterior without resorting to any truncation. Nonetheless, as it is often the case when full MCMC inference is performed, it becomes excessively slow when applied to large multidimensional datasets.

In this paper, we aim to scale up the applicability of Brand by adopting a variational inference approach, vastly improving its computational efficiency. Variational inference is an estimation technique that approximates a complex probability distribution by resorting to optimization (Jordan et al. 1999; Ormerod and Wand 2010), which has received significant attention within the statistical literature in the past decade (Blei et al. 2017). In the Bayesian set-up, the distribution of interest is, of course, the posterior distribution. Variational inference applied to Bayesian problems (also known as variational Bayes, VB from now on) induces a paradigm shift in the approximation of a complicated posterior: we switch from a simulation problem (MCMC) to an optimization one. Broadly speaking, the VB approach starts with the specification of a class of simple distributions called the variational family. Then, within this specified family, one looks for the member that minimizes a suitable distributional divergence (e.g., the Kullbak-Lielber divergence) from the actual posterior. By dealing with a minimization problem instead of a simulation one, we can considerably scale up the applicability of Brand, obtaining results for datasets with thousands of observations measured in high-dimension in a fraction of the time needed by MCMC techniques. Variational techniques have showcased the potential to enhance the relevance of the Bayesian framework even to large datasets, sparking interest in its theoretical properties (e.g., Wang and Titterington 2012; Nieman et al. 2022; Ray and Szabó 2022), and its applicability to, for instance, logistic regression (Jaakkola and Jordan 2000; Rigon 2023), network analysis (Aliverti and Russo 2022), and, more in general, non-conjugate (Wang 2012) and advanced models (Zhang et al. 2019).

The present paper is structured as follows. In Sect. 2, we review Brand and provide a summary of the variational Bayes approach. In Sect. 3, we discuss the algorithm and the hyperparameters needed for the VB version of the considered model. Section 4 reports extensive simulation studies that examine the efficiency and robustness of our method. Then, in Sect. 5, we present an application to the Statlog dataset, openly available from the UCI dataset repository. In this context, novelty detection is used to discover novel types of soil starting from satellite images. We remark that this analysis would have been prohibitive with a classical MCMC approach. Lastly, Sect. 6 concludes the manuscript.

2 Background: Bayesian novelty detection and variational Bayes

This section introduces the two stages in which Brand articulates and briefly presents the core concepts of the variational Bayes approach.

2.1 The Brand model

In what follows, we review the two-stage procedure proposed in Denti et al. (2021). The first stage centers around a fully labeled learning set from which we extract robust information to set up a Bayesian semiparametric model in the second stage. More specifically, consider a labeled training dataset with n observations grouped into J classes. This paper will focus on classes distributed as multivariate Gaussians, but one can readily extend the model using different distributions. To this regard, we will write \(\mathcal {N}\left( \cdot \mid \varvec{\Theta }\right)\) to indicate a Multivariate Gaussian density with generic location and scale parameters \(\varvec{\Theta }=\left( \varvec{\mu },\Sigma \right)\).

In the first stage of the Brand model, within each of the J training classes, we separately apply the Minimum Regularized Covariance Determinant (MRCD) estimator (Boudt et al. 2020) to retrieve robust estimates for the associated mean vector and covariance matrix. In detail, the MRCD searches for a subset of observations with fixed size h (with \(n/2\le h<n\)) whose regularized sample covariance has the lowest possible determinant. MRCD estimates are then defined as the multivariate location and regularized scatter based on the so-identified subset. Compared to the popular Minimum Covariance Determinant (MCD, Rousseeuw and Driessen 1999; Hubert et al. 2018), with MRCD we perform a convex combination of the h-subset sample covariance matrix with a pre-specified, well-conditioned positive definite target matrix. While the former has zero-determinant when \(p>h\), which would result in an ill-defined MCD solution, the regularization introduced in MRCD ensures the existence of the estimator even when the data dimension exceeds the sample size. This feature implies that the robustness properties of the original procedure are preserved whilst being applicable also to “\(p>n\)” problems. Such a characteristic is paramount in the context considered in the present paper, where we specifically aim to scale up the applicability of Brand to high-dimensional scenarios. We will denote the estimates obtained for class j, \(j=1,\ldots ,J,\) with \(\bar{\varvec{\Theta} }^{MRCD}_j = \left( \hat{\varvec{\mu }}^{MRCD}_j,\hat{\varvec{\Sigma }}^{MRCD}_j\right)\). These quantities will be employed to elicit informative priors for the J known components in the second stage. Specifically, in so doing, outliers and label noise that might be present in the training set will not hamper the Bayesian mixture model specification hereafter reported.

In the second stage, we estimate a semiparametric Bayesian classifier on a test set with M unlabeled observations. We want to build a novelty detector, i.e., a model that can discern between “known” units - which follow a pattern already observed in the training set - or “novelties”. At this point, the likelihood for each observation is a simple two-group mixture between a generic novelty component \(f_{nov}\) and a density reflecting the behavior found in the training set \(f_{obs}\). Given stage one, it is immediate to specify \(f_{obs}\) as a mixture of J multivariate Gaussians, whose hyperpriors elicitation will be guided by each of the robust estimates \(\varvec{\Theta}^{obs}_j=\bar{\varvec{\Theta} }^{MRCD}_j\). Therefore, we can now write the likelihood for the generic test observation \(\varvec{y}_m\), where \(m=1,\ldots , M\), as a mixture of \(J+1\) distributions:

$$\begin{aligned} \mathcal {L}\left( \varvec{y}_m \mid \varvec{\Theta }^{obs}, \varvec{\pi }\right) = \pi _0 f_{nov} + \sum _{j=1}^{J} \pi _j\mathcal {N}\left( \varvec{y}_m\mid \varvec{\Theta }^{obs}_{j}\right) . \end{aligned}$$
(1)

Fitting this model to the test set allows us to either allocate each of the M observations into one of the previously observed J Gaussian classes or flag it as a novelty generated from an unknown distribution \(f_{nov}\).

The specification of \(f_{nov}\) is more delicate. To specify a flexible distribution that would reflect our ignorance regarding the novelty term, we employ a Dirichlet process mixture model with Gaussian kernels (Escobar and West 1995). It is a mixture model where the mixing distribution is sampled from a Dirichlet process (Ferguson 1973), characterized by concentration parameter \(\gamma\) and base measure H. In other words, we have that \(f_{nov}(\varvec{y}_m) = \sum _{k=1}^\infty \omega _k \mathcal {N}\left( \varvec{y}_m \mid \varvec{\Theta }^{nov}_{k}\right)\), where the atoms are sampled from the base measure, i.e., \(\varvec{\Theta }^{nov}\sim H\), and the mixture weights follow the so-called stick-breaking construction (Sethuraman 1994), where \(w_k = v_k\prod _{l<k}(1-v_l)\) and \(v_l\sim Beta(1,\gamma )\). To indicate this process, we will write \(\varvec{\omega }\sim SB(\gamma )\). Thus, the likelihood in (1) becomes, for \(m=1,\ldots ,M\),

$$\begin{aligned} \mathcal {L}\left( \varvec{y}_m \mid \varvec{\Theta }, \varvec{\pi }\right) = \pi _0 \left[ \sum _{k=1}^\infty \omega _k \mathcal {N}\left( \varvec{y}_m \mid \varvec{\Theta }^{nov}_{k}\right) \right] + \sum _{j=1}^{J} \pi _j\mathcal {N}\left( \varvec{y}_m \mid \varvec{\Theta }_{j}^{obs}\right) . \end{aligned}$$

This nested-mixture expression of the likelihood highlights a two-fold advantage. First, it is highly flexible, effectively capturing departures from the known patterns and flagging them as novelties. Second, the mixture nature of \(f_{nov}\) allows to automatically cluster the novelties, capturing potential patterns that may arise. Furthermore, clusters in the novelty terms characterized by very small sizes can be interpreted as simple outliers.

Notice that the previous nested-mixture likelihood can be conveniently re-expressed as:

$$\begin{aligned} \mathcal {L}\left( \varvec{y}_m \mid \varvec{\Theta }^{obs}, \varvec{\pi }\right)&= \sum _{k=1}^{\infty } \tilde{\pi }_k \mathcal {N}\left( \varvec{y}_{m\mid} \tilde{\varvec{\Theta }}_{k}\right) , \end{aligned}$$
(2)

where, for \(k \in \mathbb {N}\), we define \(\tilde{\pi }_k = {\pi _k}^{{\mathbbm{1}}{\{0 < k \le J\}}}(\pi _0\omega _{k-J})^{{\mathbbm{1}}{\{k \ge J\}}}\) and \({\tilde{\varvec{\Theta }}_k = (\varvec{\Theta}_k^{obs})^{{\mathbbm{1}}{\{0 < k \le J\}}}(\varvec{\Theta}_{k}^{nov})^{{\mathbbm{1}}{\{k \ge J\}}}}\). Note that, without loss of generality, we regard the first J mixture components as the known components and all the remaining ones as novelties.

Finally, as customary in mixture models, to ease the computation, we augment the model by introducing the auxiliary variables \(\xi _m\in \mathbb {N}\), for \(m=1, \ldots ,M\), where \(\xi _m = l\) means that the m-th observation has been assigned to the l-th component. Therefore, the model in (2) simplifies to

$$\begin{aligned} \varvec{y}_m \mid \xi _m,\tilde{\varvec{\Theta }}, \varvec{\pi }\sim&\mathcal {N}\left( \varvec{y}_m\mid \tilde{\varvec{\Theta }}_{\xi _m}\right) ,\quad \quad {\xi _m}\mid {\tilde{\varvec{\pi }} \overset{iid}{\sim } \sum _{k=1}^\infty {\tilde{\pi }_k\delta _k(\cdot )} }. \end{aligned}$$
(3)

We complete our Bayesian model with the following prior specifications for the weights and the atoms:

$$\begin{aligned} \begin{aligned} \varvec{\pi }&\sim {Dirichlet}(\alpha _0, \alpha _1,..., \alpha _J),\quad \quad \varvec{\omega } \sim {SB}(\gamma ),\\ \varvec{\Theta }_k^{obs}&\sim {\mathcal {NIW}}(\varvec{\mu }^{{obs}}_k, \nu ^{{obs}}_k, \lambda ^{{obs}}_k, \varvec{\Psi }^{{obs}}_k), \quad \quad k = 1,\ldots ,J,\\ \varvec{\Theta }_k^{nov}&\sim {\mathcal {NIW}}({\varvec{\mu }_0^ {nov}}, \nu _0^{ {nov}}, \lambda _0^{ {nov}}, \varvec{\Psi }_0^{nov}), \quad \quad k = J+1,\ldots , \infty , \end{aligned} \end{aligned}$$
(4)

where \(\mathcal {NIW}\) indicates a Normal Inverse-Wishart distribution. To ease the notation, let \(\varvec{\Theta } = \{\varvec{\Theta }_k^{obs}\}_{k=1}^J \cup \{\varvec{\Theta }_k^{nov}\}_{k=J+1}^\infty\), \(\varvec{\varrho }_k = (\varvec{\mu }^{{obs}}_k, \nu ^{{obs}}_k, \lambda ^{{obs}}_k, \varvec{\Psi }^{{obs}}_k)\) and \(\varvec{\varrho }_0 = (\varvec{\mu }^{nov}_0, \nu ^{nov}_0, \lambda ^{nov}_0, \varvec{\Psi }^{nov}_0)\). We remark that the values for the hyperparameters \(\{\varvec{\varrho }_k^{obs}\}_{k = 1}^J\) are defined according to the robust estimates obtained in the first stage. See Denti et al. (2021) for more details about the hyperparameters specification.

Given the model, it is easy to derive the joint law of the data and the parameters, which is proportional to the posterior distribution of interest \(p\left( {\varvec{\Theta }}, \varvec{\pi },\varvec{\xi } \mid \varvec{y} \right)\). Therefore, the posterior we target for inference is proportional to:

$$\begin{aligned} p\left( {\varvec{\Theta }}, \varvec{\pi },\varvec{\xi } \mid \varvec{y} \right) \propto&\prod _{m=1}^M\left[ \prod _{k=1}^J\mathcal {N}(\varvec{y}_m\mid \varvec{\Theta }_k^{obs})^{\mathbbm {1}_{\xi _m=k}}\prod _{k=J+1}^{\infty }\mathcal {N} (\varvec{y}_m\mid \varvec{\Theta }_k^{nov})^{\mathbbm {1}_{\xi _m=k}}\right] \nonumber \\&\times \prod _{k=1}^J\mathcal {\mathcal {NIW}}(\varvec{\Theta }_k^{obs}\mid \varvec{\varrho }_k) \prod _{k=J+1}^{\infty }{\mathcal {NIW}}(\varvec{\Theta }_k^{nov}\mid \varvec{\varrho }_0) \nonumber \\&\times \prod _{m=1}^M \left[ \prod _{k=1}^J(\pi _k)^{\mathbbm {1}_{\xi _m=k}}\prod _{k=J+1}^{\infty }\pi _0\left( v_{k-J}\prod _{h=1}^{k-J-1}(1-v_h)\right) ^{\mathbbm {1}_{\xi _m=k}}\right] \nonumber \\&\times \prod _{k=0}^J(\pi _k)^{\alpha _k-1}\prod _{k=1}^{\infty }(1-v_k)^{\gamma -1}. \end{aligned}$$
(5)

The left panel of Fig. 1 contains a diagram that summarizes how Brand works. In the following, we will devise a VB approach to approximate (5) in a timely and efficient manner. The following subsection briefly outlines the general strategy underlying a mean-field variational approach, while the thorough derivation of the algorithm used to estimate Brand is deferred to Sect. 3.

Fig. 1
figure 1

Diagrams depicting how Brand (left panel) and variational inference (right panel) work, respectively

2.2 A short summary of mean-field variational Bayes

Working in a Bayesian setting, we ultimately seek to estimate the posterior distribution to draw inference. Unfortunately, the expression in (5) is not available in closed form, and therefore we need to rely on approximations. MCMC algorithms, which simulate draws from (5), could be prohibitively slow when applied to large datasets. To this extent, we resort to variational inference to recasts the approximation of (5) into an optimization problem. For notational simplicity, while we present the basic ideas of VB, let us denote a generic target distribution with \(p(\varvec{\theta }\mid \varvec{y})\propto p(\varvec{\theta }, \varvec{y})\).

As in Blei et al. (2017), we focus on a mean-field variational family \(\mathcal {Q}\), a set containing distributions for the parameters of interest \(\varvec{\theta }\) that are all mutually independent: \(\mathcal {Q} = \{q_{\varvec{\zeta }}(\varvec{\theta }): q_{\varvec{\zeta }}\varvec{\theta } = \prod _jq_{\zeta _j} ({\theta }_j)\}\). The postulated independence dramatically simplifies the problem at hand. Notice that each member of \(\mathcal {Q}\) depends on a set of variational parameters denoted by \(\varvec{\zeta }\).

We seek, among the members of this family, the candidate that provides the best approximation of our posterior distribution \(p(\varvec{\theta }\mid \varvec{y})\). Herein, the goodness of the approximation is quantified by the Kullback–Leibler (KL) divergence. Thus, we aim to find the member of the variational family \(\mathcal {Q}\) that minimizes the KL divergence between the variational approximation and the actual posterior distribution. The KL divergence \({D}_{KL} (\cdot \mid \mid \cdot )\) can be written as \({D}_{KL} (q_{\varvec{\zeta }}(\varvec{\theta })\mid \mid p(\varvec{\theta } \mid \varvec{y}))) = \mathbb {E}_q[\log {q_{\varvec{\zeta }}(\varvec{\theta })}] - \mathbb {E}_q[\log {p(\varvec{\theta },\varvec{y})}]+\log p(\varvec{y}).\) Unfortunately, we cannot easily compute the evidence \(p(\varvec{y})\). However, the evidence does not depend on any variational parameter and can be treated as fixed w.r.t. \(\varvec{\theta }\) during the optimization process. We then re-formulate the problem into an equivalent one, the maximization of the Evidence Lower Bound (ELBO), which is fully computable:

$$\begin{aligned} ELBO(q) = \mathbb {E}_q[\log {p(\varvec{\theta},\varvec{y})}] - \mathbb {E}_q[\log {q_{\varvec{\zeta }}(\varvec{\theta})}]. \end{aligned}$$
(6)

Maximizing (6) is equivalent to minimizing the aforementioned KL divergence.

To detect the optimal member \(q^\star \in \mathcal {Q}\), we employ a widely used algorithm called Coordinate Ascent Variational Inference (CAVI). It consists of a one-variable-at-a-time optimization procedure. Indeed, exploiting the independence postulated by the mean-field property, one can show that the optimal variational distribution for the parameter \(\theta _j\) is given by:

$$\begin{aligned} q^\star _{\zeta _j}(\theta _j)\propto \exp \{\mathbb {E}_{-j}\left[ \log p(\theta _j, \varvec{\theta }_{-j},\varvec{y})\right] \}, \end{aligned}$$
(7)

where \(\varvec{\theta }_{-j} = \{\theta _l\}_{l\ne j}\) and the expected value is taken w.r.t. the densities of \(\varvec{\theta }_{-j}\). The CAVI algorithm iteratively computes (7) for every j until the ELBO does not register any significant improvement. The basic idea behind the CAVI algorithm is depicted in the right-half of Fig. 1.

In the next subsections, we will derive the CAVI updates for the variational approximation of our model, along with the corresponding expression of the ELBO. We call the resulting algorithm the variational Brand, or VBrand.

3 Variational Bayes for the Brand model

In this section, we tailor the generic variational inference algorithm to our specific case. First, we write our variational approximation, highlighting the dependence of each distribution on its specific variational parameters, collected in \(\varvec{\zeta } =\left( \varvec{\eta },\varvec{\varphi },\varvec{a},\varvec{b},\varvec{\rho }^{obs},\varvec{\rho }^{nov} \right)\). The factorized form we adopt reads as follows:

$$\begin{aligned} \begin{aligned} q_{\varvec{\zeta }}( \varvec{\xi }, \varvec{\pi }, \varvec{v}, \varvec{\Theta }^{obs}, \varvec{\Theta }^{nov}) =&\;q_{\varvec{\eta }}(\varvec{\pi }) \prod _{m=1}^M q_{\varvec{\varphi }^{(m)}} (\xi ^{(m)}) \prod _{k=1}^{T-1} q_{a_k,b_k}(v_k) \\&\times \prod _{k=1}^J q_{\varvec{\rho }_k^{obs}} (\varvec{\Theta }^{obs}_k) \prod _{k=J+1}^{J+T} q_{\varvec{\rho }_k^{nov}} (\varvec{\Theta }^{nov}_k). \end{aligned} \end{aligned}$$
(8)

In Eq. (8), we truncated the stick-breaking representation of the Dirichlet Process on the novelty term at a pre-specified threshold T, as suggested in Blei and Jordan (2006). This implies that \(q(v_T = 1) = 1\) and that all the variational mixture weights indexed by \(t > T\) are equal to zero.

Then, we can exploit a key property of VB. Note that all the full conditionals of the Brand model have closed-form expressions (c.f.r. Section S1 of the Supplementary Material) and belong to the exponential family. This feature greatly simplifies the search for the variational solution. Indeed, it ensures that the corresponding optimal variational distributions belong to the same family of the corresponding full conditional, with properly updated variational parameters. Therefore, we can already state that \(q_{\varvec{\eta }}(\varvec{\pi })\) is the density function of a Dirichlet distribution, \(q_{\varvec{\varphi }^{(m)}}(\xi ^{(m)})\) is a Categorical distribution, each \(q_{a_k,b_k}(v_k)\) is a Beta distributions, and \(q_{\varvec{\rho }_k^{obs}} (\varvec{\Theta }^{obs}_k)\) and \(q_{\varvec{\rho }_k^{nov}} (\varvec{\Theta }^{nov}_k)\) are both Normal Inverse-Wishart distributions.

3.1 The CAVI parameters update

Once the parametric expressions for the members of the variational family are obtained, we can derive the explicit formulas to optimize the parameters via the CAVI algorithm. In this subsection, we state the updating rules that have to be iteratively computed to fit VBrand. As we will observe, the set of responsibilities \(\{\varphi _k^{(m)}\}_{k = 1}^{J+T}\) for \(m=1,\ldots ,M\), i.e., the variational probabilities \(\varphi ^{(m)}_k=q(\xi ^{(m)}=k)\), will play a central role in all the steps. In detail, the CAVI steps are as follows:

  1. 1.

    \(q^\star _{\varvec{\eta }}(\varvec{\pi })\) is the density of a \(Dirichlet(\eta _0,\eta _1,\ldots ,\eta _J)\), where

    $$\begin{aligned} \eta _0 = \alpha _0 + \sum _{m=1}^M\sum _{l=J+1}^{J+T}\varphi _l^{(m)}, \quad \; \eta _j = \alpha _j + \sum _{m=1}^M\varphi _j^{(m)}, \; \text { for }j=1,\ldots ,J. \end{aligned}$$
    (9)

    Here, each hyperparameter linked to a known component is updated with the sum of the responsibilities of the data belonging to the same specific known component. Likewise, the variational novelty probability hyperparameter \(\eta _0\) contains the sum of the responsibilities of all data belonging to all the novelty terms.

  2. 2.

    Each \(q^\star _{a_k,b_k}(v_k)\), for \(k=1,\ldots ,T-1\), is the density of a \(Beta(a_k,b_k)\). The update for these variational parameters is given by:

    $$\begin{aligned} a_k = 1 + \sum _{m=1}^M\varphi _{J+k}^{(m)}, \quad b_k = \gamma + \sum _{l=k+1}^T\sum _{m=1}^M\varphi _{J+l}^{(m)}. \end{aligned}$$
    (10)

    Here, as expected from the stick-breaking construction, the first parameter of the variational Beta distribution is updated with the sum of the probabilities of each point belonging to a specific novelty cluster k. At the same time, the second parameter is updated with the sum of variational probabilities of belonging to one of the next novelty components.

  3. 3.

    Both \(q^\star _{\varvec{\rho }_k^{nov}} (\varvec{\Theta }^{nov}_k)\) and \(q^\star _{\varvec{\rho }_k^{obs}} (\varvec{\Theta }^{obs}_k)\) are Nomal Inverse-Wishart densities. Let us start with the updating rules of the known components. Note that each variational parameter \(\varvec{\rho }_k^{obs}\) is a shorthand for \(\left( \varvec{m}^{obs}_k,\ell ^{obs}_k, u^{obs}_k, \varvec{S}^{obs}_k \right)\). These parameters have the same interpretation as the parameters of (4), contained in \(\varvec{\varrho }_k\). So we have

    $$\begin{aligned} \varvec{m}^{obs}_k&= \frac{1}{\lambda ^{obs}_k+\sum _{m=1}^M\varphi _{k}^{(m)}}\left( \lambda ^{obs}_k\varvec{\mu }^{obs}_k + \sum _{m=1}^M\varvec{y}_m\varphi _{k}^{(m)}\right) , \\ \ell _k^{obs}&= \lambda _k^{obs} + \sum _{m=1}^M\varphi _{k}^{(m)},\quad u^{obs}_k = \nu _K^{obs} + \sum _{m=1}^M\varphi _{k}^{(m)},\\ \varvec{S}^{obs}_k&= \varvec{\Psi }^{obs}_k + \sum _{m=1}^M \hat{\varvec{\Sigma }}_{k}^{(m)} + \frac{\lambda ^{obs}_k\sum _{m=1}^M\varphi _{k}^{(m)}}{\lambda ^{obs}_k+\sum _{m=1}^M\varphi _{k}^{(m)}}(\overline{\varvec{y}}_k-\varvec{\mu }^{obs}_k)^T(\overline{\varvec{y}}_k-\varvec{\mu }^{obs}_k), \end{aligned}$$
    (11)

    where we defined \(\hat{\varvec{\Sigma }}_{k}^{(m)} = (\varvec{y}_m-\overline{\varvec{y}}_k)(\varvec{y}_m-\overline{\varvec{y}}_k)^T\varphi _{k}^{(m)}\) and \(\overline{\varvec{y}}_k = \sum _{m=1}^M\varvec{y}_m\varphi _{k}^{(m)}/\sum _{m=1}^M\varphi _{k}^{(m)}.\) The update for the parameters in \(\varvec{\rho }_k^{nov}\) follows the same structure, with the hyperprior parameters in \(\varvec{\varrho }^{nov}\) carefully substituted to \(\varvec{\varrho }^{obs}\).

  4. 4.

    Updating the responsibilities \(\{\varphi _k^{(m)}\}_{k = 1}^{J+T}\) for \(m=1,\ldots ,M,\) is the most challenging step of the algorithm, given the nested nature of the mixture in (2). We recall that, for a given m, the distribution \(q_{\varvec{\varphi }^{(m)}}(\xi ^{(m)})\) is categorical with \(J+T\) levels. Thus, we need to compute the values for the \(J+T\) corresponding probabilities. For the known classes \(k=1,\ldots ,J\), we have

    $$\begin{aligned} \log \varphi _k^{(m)} \propto \mathbb {E}[\log {\pi _k}] + \mathbb {E}[\log {\mathcal {N}(\varvec{y}_m\mid \varvec{\Theta }_k^{obs})}], \end{aligned}$$
    (12)

    while for the novelty terms \(k=J+1,\ldots ,J+T\), we have

    $$\begin{aligned} \log \varphi _k^{(m)} &\propto \mathbb {E}[\log {\pi _0}] + \mathbb {E}[\log {v_{k-J}}] + \sum _{l=1}^{k-J-1}\mathbb {E}[\log \left( 1-v_{l}\right) ] \\ &\quad +\mathbb {E}[\log {\mathcal {N}(\varvec{y}_m\mid \varvec{\Theta }_{k}^{nov})}]. \end{aligned}$$
    (13)

    For the sake of conciseness, we report the explicit expression for all the terms of (12) and (13) in Section S2 of the Supplementary Material. This means that the probability of datum \(\varvec{y}_m\) to belong to cluster k depends on the likelihood of \(\varvec{y}_m\) under that same cluster and on the overall relevance of the \(k^{th}\) cluster. Such relevance is determined as the expected value of the relative component of the Dirichlet distributed \(\varvec{\pi }\) and, for novelties, the stick-breaking weight, here unrolled in its Beta-distributed components.

3.2 The expression of the ELBO for VBrand

In this subsection, we report the terms that need to be derived to obtain the ELBO in Eq. (6) for the VBrand model. We start by computing the first term of Eq. (6), which takes the following form:

$$\begin{aligned}\begin{aligned} \mathbb {E}[\log {p}] =&\sum _{m=1}^M\left( \sum _{k=1}^J f_{1}^{(m,k)} + \sum _{k=J+1}^{J+T} f_{2}^{(k)} \right) + \sum _{k=1}^J f_{3}^{(k)} + \sum _{k=J+1}^{J+T} f_{4}^{(k)} \\&+ \sum _{m=1}^M\left( \sum _{k=1}^J f_{5}^{(m,k)} + \sum _{k=J+1}^{J+T} f_{6}^{(m,k)} \right) + \sum _{k=0}^{J} f_{7}^{(k)} + \sum _{l=1}^{T} f_{8}^{(l)} + const, \end{aligned}\end{aligned}$$

where the quantities \(\{f_k\}_{k=1}^8\) have the following expressions (note we have suppressed the superscripts to ease the notation, and that \(\varvec{\psi }(\cdot )\) indicates the digamma function):

$$\begin{aligned} \begin{aligned} f_{1}&= \varphi _{k}^{(m)}\mathbb {E}[\log {{N}(\varvec{y}_m\mid \varvec{\Theta }_k^{obs})}],&\quad f_{2}&= \varphi _{k}^{(m)}\mathbb {E}[\log {{N}(\varvec{y}_m\mid \varvec{\Theta }_k^{nov})}], \\ f_{3}&= \mathbb {E}[\log {{\mathcal {NIW}}(\varvec{\Theta }_k^{obs}\mid \varvec{\varrho }_k)}],&\quad f_{4}&= \mathbb {E}[\log {{\mathcal {NIW}}(\varvec{\Theta }_k^{nov}\mid \varvec{\varrho }_0})]. \end{aligned} \end{aligned}$$

Moreover, we have \(f_{5} = \varphi _{k}^{(m)}\left( \varvec{\psi }\left( \eta _k\right) -\varvec{\psi }\left( \sum _{j=0}^J\eta _j\right) \right)\),

$$\begin{aligned} \begin{aligned} f_{6} = \varphi _{k}^{(m)}\left[ \varvec{\psi }(\eta _0)-\varvec{\psi }(\sum _{j=0}^J\eta _j) + \varvec{\psi }(a_{k-J}+b_{k-J}) + \sum _{h=1}^{k-J-1}\varvec{\psi }(b_{h})-\varvec{\psi }(a_{h}+b_{h})\right] , \end{aligned} \end{aligned}$$

and, lastly, \(f_{7} = (\alpha _k-1)(\varvec{\psi }(\eta _k)-\varvec{\psi }(\sum _{j=0}^J\eta _j)),\) and \(f_{8} = (\gamma -1)(\varvec{\psi }(b_l)-\varvec{\psi }(a_l+b_l))\).

The second term of Eq. (6) can be written as

$$\begin{aligned} \mathbb {E}[\log {q}] = \sum _{m=1}^{M}\left( h_1^{(m)} + h_2^{(m)} + \sum _{k=1}^{T}h_3^{(m,k)} + \sum _{k=1}^{J}h_4^{(m,k)} + \sum _{k=1}^{T}h_5^{(m,k)}\right) , \end{aligned}$$

with \(h_{1} = \sum _{k=1}^{J+T}\varphi _{k}^{(m)}\ln {\varphi _{k}^{(m)}} - \ln {\sum _{k=0}^{J+T}\varphi _{k}^{(m)}}\), \(h_{4} = \mathbb {E}[\log {\mathcal {NIW}(\varvec{\Theta }_k^{obs}\mid \varvec{\rho }_k^{obs})}]\),

\(h_{5}= \mathbb {E}[\log {\mathcal {NIW}(\varvec{\Theta }_k^{nov}\mid \varvec{\rho }_k^{nov})}]\), and, finally,

$$\begin{aligned} \begin{aligned} h_{2} =&\sum _{j=0}^{J}(\eta _j-1)\left( \varvec{\psi }(\eta _j)-\varvec{\psi }(\sum _{j=0}^J\eta _j)\right) + \ln {\Gamma \left( \sum _{j=0}^J{\eta _j}\right) } - \sum _{j=0}^{J}\ln {\Gamma (\eta _j)},\\ h_{3} =&\;(a_k-1)(\varvec{\psi }(a_k)-\varvec{\psi }(a_k+b_k))\\&+(b_k-1)(\varvec{\psi }(b_k)-\varvec{\psi }(a_k+b_k)) - \ln {\left( \frac{\Gamma (a_k)\Gamma (b_k)}{\Gamma (a_k+b_k)}\right) }. \end{aligned} \end{aligned}$$

Additional details are deferred to Section S2 of the Supplementary Material.

At each step of the algorithm, one needs to update the parameters according to the rules (9)–(13) and to evaluate all the terms in \(\{f_k\}_{k=1}^8\) and \(\{h_k\}_{k=1}^5\). Albeit faster than MCMC, this articulates over numerous steps. To further reduce the overall computing time, an R package relying on an efficient C++ implementation has been implemented. The package is openly available at the GitHub repository JacopoGhirri/VarBRAND. In the same repository, the interested reader can find all the R scripts written to run the simulation studies and real data application that we will discuss in the following sections.

3.3 VBrand clustering estimation

Ultimately, we seek to partition the observations into either known components or novelties. While the recovery of a single posterior clustering solution is not immediate when using the Gibbs sampler due to label-switching (see Denti et al. (2021) for a description of the post-processing algorithm in this context), it is almost straightforward to obtain it with a variational inference approach. As previously mentioned, the responsibilities are pivotal in this context. Once the convergence of the ELBO is reached, we are left with a matrix \(\Phi ^{\star }=\{\varphi ^{\star ,(m)}_k \}\), with M rows - one for each observation - and \(J+T\) columns - one for each possible cluster (known or novel). Then, for each observation \(m=1,\ldots , M\), the optimal clustering assignment is given by

$$\begin{aligned} c_m^{\star } = \mathop {\textrm{arg max}}\limits _{k=1,\ldots ,J+T} \varphi ^{\star ,(m)}_k. \end{aligned}$$

Of course, \(c_m^{\star } \in \{1,\ldots ,J+T\}\). The set \(\varvec{c}^\star = \left( c_1^{\star },\ldots ,c_M^{\star } \right)\) contains the estimated partition of the data. We underline that, while \(J+T\) clusters are fitted to the data, the number of distinct values in \(\varvec{c}^\star\) will potentially be much lower than \(J+T\). In other words, we must distinguish between the number of fitted mixture components and populated components. This way, \(T\) can be interpreted as an upper bound of the number of novelty components expected to be observed while we let the data estimate the number of needed novelties \(T^\star \le T\).

4 Simulation studies

In this section, we report the performance of our variational algorithm on a range of simulated datasets. In particular, our simulation study is articulated into three different experiments. Each experiment investigates a different aspect of the model while altogether providing a multi-faceted description of its performance.

The first experiment focuses on the scaling capabilities of our proposal; the second compares the results and efficiency of VBrand with the original slice sampler, while the last one assesses the sensitivity of the recovered partition to the hyperprior specification. For the MCMC algorithm, we follow the default hyperprior setting suggested in Denti et al. (2021). Moreover, we run the slice sampler for 20,000 iterations, discarding the first 10,000 as burn-in. As for VBrand, we use the same hyperprior specifications used for the MCMC (unless otherwise stated), while we use k-means estimation to initialize the means of the novelty terms. We set a threshold \(\varepsilon = 10^{-9}\) as stopping rule.

Table 1 Characteristics of the synthetic datasets used in the three simulation studies (SS1−SS3). The components flagged with \(^*\) are novelties. \(^\dagger\) the variances reported in this table refer to the low overlap scenario. For the high overlap scenario, we consider \(\sigma ^{2\star } = 6\sigma ^2\)

4.1 Classification performance

We test the classification performance of our proposal by applying VBrand to a sequence of increasingly complex variations of a synthetic dataset. We monitor the computation time and different clustering metrics to provide a complete picture of the overall performance. In particular, we compute the Adjusted Rand Index (ARI, Hubert and Arabie (1985)), the Adjusted Mutual Information (AMI, Vinh et al. (2009)), and the Fowlkes-Mallows Index (FMI, Fowlkes and Mallows (1983)). While the first two metrics correct the effect of agreement solely due to chance, the latter performs well also if noise is added to an existing partition. Thereupon, the joint inspection of these three quantities aim to provide a complete picture of the results.

The considered data generating process (DGP) for this experiment is based on a mixture of 7 bivariate Normals. In detail, the first three components represent the known classes, appearing in both training and test sets, while the remaining 4 are deemed to be novelties, present only in the test set. The components are characterized by different mean vectors \(\varvec{\mu }_k\), correlation matrices \((\sigma _k^2,\rho _k)\), and cardinalities in the training (\(n_{k}\)) and test (\(M_{k}\)). We consider a reference scenario, where \(\sum _{k=1}^3 n_k = n=900\) and \(\sum _{k=1}^7 M_k = M=1000\). The main attributes of this DGP are summarised in the first block, named SS1, of Table 1.

Starting from this basic mechanism, we subsequently increment the difficulty of the classification tasks. Specifically, we modify both the data dimension and the sample size as follows:

  • Sample size: while keeping unaltered the mixture proportions, we consider the sample sizes \(\tilde{n}_k=q\cdot n_k\) and \(\tilde{M}_k=q\cdot M_k\), with multiplicative factor \(q\in \{0.5, 1, 2.5, 5, 10\}\) in both training and test sets;

  • Data dimensionality: we augment the dimensionality p of the problem by considering \(p\in \{2, 3, 5, 7, 10\}\). Each added dimension (above the second) comprises independent realization from a standard Gaussian. Note that the resulting datasets define a particularly challenging discrimination task: all the information needed to distinguish the different components is contained in the first two dimensions. In contrast, the remaining ones only display overlapping noise.

For each combination of sample size and dataset dimension, we generate 50 replications of each simulated dataset and summarize the results by computing the means and the standard errors of the chosen metrics. Results for this experiment are reported in Fig. 2. We immediately notice that the clustering performances deteriorate as the dimensionality of the problem increases. This trend is expected, especially given the induced overlap in the added dimensions. However, the classification abilities of our method remain consistent and satisfactory across all metrics. Indeed, as we can see from the three panels at the top, ARI, AMI, and FMI are all strictly above 70% across all scenarios. This outcome indicates that not only are the known classes correctly identified and clustered as such, but the flexible nonparametric component effectively captures the novelty term. The computation time (bottom panel) grows exponentially as a function of the test set cardinality. Interestingly, the increment of data dimensionality does not significantly impact the computational costs, suggesting an effective scalability of our proposal to high dimensional problems. Indeed, even when the test size is in the order of tens of thousands, the devised CAVI algorithm always reaches convergence in less than half a minute. Also, we remark that the time needed for convergence is sensitive to the different initialization provided to the algorithm, explaining the high variance that characterized the computational costs. Finally, it is worth commenting on the seemingly higher computational costs that appear when \(p=2\). The lower panel of Fig. 2 contains the total time needed to reach convergence, which changes across runs, as the number of iterations may vary. The lower dimensionality allows the algorithm to better explore the mean-field distributional space, obtaining more precise solutions at the cost of higher total convergence time. This feature is reflected by consistent good performances of our method when \(p=2\). For completeness, in Section S3 of the Supplementary Material, we report a plot summarizing the computational cost per iteration.

Fig. 2
figure 2

Performance metrics and elapsed time obtained by the VBrand algorithm stratified by number of variables p and size scaling n. The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors

4.2 Comparison with MCMC

We now compare the variational and the MCMC approaches for approximating the Brand posterior. The MCMC algorithm we consider is the modified slice-sampler introduced in Denti et al. (2021). We compare the estimating approaches leveraging on the same DGP highlighted in the previous section and slightly modified accordingly. In detail, we consider five spherical bivariate Gaussian to generate the classes, out of which three are deemed as novelties: the basic details of the resulting DGP can be found in the second block of Table 1 (SS2). We consider 50 different scenarios resulting from the interactions of the levels of the following three attributes:

  • Simple vs. complex scenarios: we set the variance of all the mixture components to either \(\sigma _S^2=0.2\) or \(\sigma _C^2=0.75\) (the only exception being \(\sigma _{C_5^*}^2=0.375\)). The former value implies clear separation among the elements in the simple scenario. In contrast, the latter variance defines the complex case, where we induce some overlap that may hinder the classification. A descriptive plot is displayed in Section S3 of the Supplementary Material;

  • Sample size: we modify the default sample sizes \(n =M= 1000\) by considering different multiplicative factors in \(q\in \{0.5, 1, 2.5,5,10\}\), thus obtaining datasets ranging from 500 to 10000 observations.

  • Data dimensionality: we augment the dimensionality p of the problem by considering \(p\in \{2, 3, 5, 7, 10\}\). The dimensionality augmentation is carried out as described in Sect. 4.1.

We assess the classification performances with the same metrics previously introduced. For each of the 50 simulated scenarios, we perform 50 Monte Carlo replicates to assess the variation in the performances. A summary of the classification results under the complex scenario are reported in Fig. 3. The panels show that the slice sampler always outperforms the VB implementation in the two-dimensional case. However, as the dimensionality of the dataset increases the MCMC performance rapidly drops, while VBrand always obtains good clustering recovery, irrespective of the data dimensionality and the sample size. Similar results are obtained under the simple scenario, for which a summarizing plot can be found in Section S3 of the Supplementary Material. Figure 4 compares the algorithms in terms of computation time. As expected, VBrand provides results in just a fraction of the time required by the MCMC approach, being approximately two orders of magnitude faster.

Fig. 3
figure 3

Performance metrics (MCMC in red, variational Bayes - VB - in blue) stratified by number of variables p and sample size scaling factor q under the complex scenario. The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors

Fig. 4
figure 4

Computational time in seconds (MCMC in red, variational Bayes - VB - in blue) grouped by number of variables p, sample size scaling factor q, and type of scenario. The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors

These results cast light not only on the apparent gain in computational speed when using the variational approach, which is expected by any means, but also on the superior recovering of the underlying partition in the test set in more complex scenarios.

4.3 Sensitivity analysis

Finally, we investigate the sensitivity of the model classification to different hyperprior specifications. Two pairs of crucial hyperparameters may considerably affect the clustering results. The first two are \(\varvec{\alpha }\) and \(\gamma\), which drive the parametric and nonparametric mixture weights, respectively. The second pair is given by the \(\mathcal {NIW}\) precision \(\lambda ^{nov}_0\) and degrees of freedom \(\nu ^{nov}_0\) of the novelty components. To assess their impact, we devise a sensitivity analysis considering each possible combination of the following hyperparameters: \(\varvec{\alpha } \in \{0.1,0.55,1\}\), \(\gamma \in \{1,5.5,10\}\), \(\lambda ^{nov}_0\in \{1,5,10\}\), and \(\nu ^{nov}_0\in \{4,52,100\}\), thus defining 81 scenarios. We fit VBrand to a dataset composed of five bivariate Normals, considering a fixed sample size for both training and test sets. We chose a small sample size for the training set to limit the informativeness of the robust estimation procedure. Moreover, for each combination of the hyperparameters, we consider both low and high values for the variances of the mixing components, obtaining scenarios with low overlap (LOV) and high overlap (HOV), respectively. Additional details about the data-generating process can be found in the third block of Table 1 (SS3). For this experiment, we compare the retrieved partitions in terms of ARI, as done in Sects. 4.1 and 4.2, and by monitoring the F1 score, i.e., the harmonic mean of precision and recall. The results are reported in Fig. 5. We immediately notice by inspecting the panels that for the LOV case the method performs well regardless of the combination of hyper-parameters chosen for the prior specification. In the HOV scenario the recovery of the underlying true data partition is less effective, as it is nevertheless expected. In particular, it seems that setting a high value for the degrees of freedom \(\nu ^{nov}_0\) produces a slight drop in the ARI metric. This behavior is due to the extra flexibility allowed to the novelty component, by which some of the units belonging to the known groups are incorrectly captured by the novelty term.

Fig. 5
figure 5

Classification metrics obtained over 50 replicates for 81 combinations of the hyperparameters \(\varvec{\alpha }\), \(\gamma\), \(\lambda ^{nov}_0\), and \(\nu ^{nov}_0\) under the low overlap (LOV) and high overlap (HOV) cases. The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors

In summary, our proposal increases the scalability of the approach introduced in Denti et al. (2021), allowing for the original novelty detector to be successfully applied to complex and high-dimensional scenarios while being protected by contaminated samples as the robust learning phase (the first stage) presented in Denti et al. (2021) remains unaltered.

5 Application to novel soil type detection

For our application, we consider the Statlog (Landsat Satellite) Data Set, publicly obtained from the UCI machine-learning repository.Footnote 1 It consists of a collection of observations from satellite images of different soils. Each image contains four spectral measurements over a 3x3 grid, recorded to classify the soil type captured in the picture. There are a total of six different soil types recorded: Red Soil (RS), Grey Soil (GS), Damp Grey Soil (DGS), Very Damp Grey Soil (VDGS), Cotton Crop (CC), and Soil with Vegetation Stubble (SVS). We frame the original classification problem in a novelty detection task by removing the images of CC and SVS from the training set, leaving these groups in the test set to be detected as novelties.

Even when performing a simple classification, a method that can account for the possible presence of previously unseen scenarios can be of paramount utility in many fields. For example, new plants (Christenhusz and Byng 2016), animals (Camilo et al. 2011), and viruses (Woolhouse et al. 2012) are progressively discovered every year. Likewise, related to our application, landscapes present novelties at increasing rates (Finsinger et al. 2017). Moreover, a scalable model that can discern and separate outlying observations is necessary when dealing with real-world data, allowing the results to be robust to outliers or otherwise irregular observations. Once these observations are flagged, they can be the objective of future investigations. Thus, our novelty detection application to the Statlog dataset is a nontrivial example that could encourage the broader utilization of our method.

The original data are already split into training and test sets. After removing the CC and SVS classes from the training set, we obtain a training set of \(n=3486\) observations. The test set instead contains \(M=2000\) instances. Each observation includes the four spectral values recorded over the 9-pixel grids. Therefore, we will model these data with a semiparametric mixture of 36-dimensional multivariate Normals. Given the large dimensionality of the dataset, the application of the MCMC estimation approach is problematic in terms of both required memory and computational time. Indeed, estimating the model via slice sampler becomes unfeasible for most laptop computers. Moreover, we recall that the MCMC approach showed some numerical instabilities in our simulation studies when applied to large dimensional datasets.

We apply VBrand adopting a mixture with full covariance matrices to capture the potential dependence across the different pixels. Being primarily interested in clustering, we first rescale the data to avoid numerical problems. In detail, we divided all the values in both training and test by 4.5 to reduce the dispersion. Indeed, before the correction, the variabilities within groups ranged from 25.40 to 228.04. After, the within-group variability ranges from 1.25 to 11.26, significantly improving the stability of the algorithm.

Since the variational techniques are likely to find a locally optimal solution, we run the CAVI algorithm 200 times adopting different initializations. For each run, we obtain different random starting point as follows:

  • we set the centers for the novelty \(\mathcal {NIW}\)s equal to the centers returned by a k-means algorithm performed over the whole test set, with k being equal to the chosen truncation;

  • the Dirichlet parameters, \(\varvec{\nu }^{nov}\) and \(\varvec{\lambda }^{nov}\) are randomly selected. In particular, we sample the Dirichlet parameters from \(\left( 0.1, 1\right)\) and the \(\mathcal {NIW}\) parameters from \(\left( 1,10\right)\) through Latin Hypercube sampling.

The other variational hyperparameters are assumed fixed, equal to the corresponding prior hyperparameters. In Section S4 of the Supplementary Material, we report a detailed list of the hyperparameter specifications that we adopted for this analysis.

As a final result, we select the run with the highest ELBO value, being the one whose variational distribution has the lowest KL divergence from the true posterior. The top-left panel of Fig. 6 shows the ELBO trends for all the run we performed. The bottom panel of Fig. 6 reports the resulting confusion matrix: \(T=10\) potential novelty components are fitted to the data, but only \(T^\star =8\) have been populated. We observe that the algorithm successfully detected both novelties, achieving a satisfactory classification performance of the previously observed soil types (ARI\(=0.590\), FMI\(= 0.664,\) and AMI\(= 0.593\)). However, the model struggles with classifying the DGS instances, often mistaken for GS or VDGS. Such difficulty is explained by the overlap between these groups, as shown by the visualization of the test set obtained via the tSNE projection (Hinton and van der Maaten 2008), reported in the top-right panel of Fig. 6: from the plot, we see that it is not straightforward to establish clear boundaries between GS, DGS, and VDGS soil types.

Fig. 6
figure 6

Top-left panel: collection of ELBO trajectories obtained via CAVI updates starting from 200 different random intializations; the trajectory providing the highest ELBO is highlighted in blue (the y axis is truncated for improved visualization). Top-right panel: projection of the test dataset onto a two-dimensional space via the tSNE algorithm. Bottom panel: heatmap of the resulting confusion matrix

Overall, VBrand captures the main traits of the data and flags some observations as outliers (e.g., Novelty clusters 4 and 5), which may warrant further investigation. All in all, our variational approach provides a good clustering solution in a few seconds and it is fast enough to allow for a brute-force search for a better, albeit only locally optimal, solution employing multiple initializations.

6 Discussion and conclusions

Performing novelty detection in high-dimensional and massive datasets presents substantial challenges, arising from the intrinsic, yet widespread nowadays, difficulty of the task and the associated computational complexities. To this aim, Bayesian inference offers an effective solution to tackle the problem, as it provides a well-defined probabilistic framework that allows for the fruitful incorporation of pre-existing knowledge into the modeling pipeline through informative prior specifications. However, the use of simulation-based algorithms, such as MCMC techniques commonly employed in estimating Bayesian models, may pose severe limitations to the scalability and applicability of prior-informed novelty detectors when huge datasets are to be processed.

Motivated by this issue, in this paper we introduced VBrand, a variational Bayes algorithm for novelty detection, to classify instances of a test set that may conceal classes not observed in a training set. We showed how VBrand outperforms the previously proposed slice sampler implementation in terms of both computational time and robustness of the estimates. The application to soil data provides an example of the versatility of our method in a context where the MCMC algorithm fails because of the large dimensionality of the problem.

Our results pave the way for many possible extensions. First, the variational algorithm can be enriched by adding a hyperprior distribution for the concentration parameter of the novelty DP (Escobar and West 1995). While in practice VBrand already obtains very good classification performance, this addition would lead to a consistent model for the number of true clusters. Second, we can consider different likelihood specifications and develop variational inference novelty detectors for, but not limited to, functional or graph-valued data. Third, at the expense of efficiency, we can explore more complex specifications for the variational distributions, as in structured variational inference. Albeit potentially slower, this choice would lead to an algorithm that could better capture the complex structure of the posterior distribution we are targeting. Finally, we can resort to stochastic variational inference algorithms (Hoffman et al. 2003), to scale up VBrand’s applicability to massive datasets that could benefit from novelty detection techniques.