Variational Inference for Semiparametric Bayesian Novelty Detection in Large Datasets

After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.


Introduction
Statistical methods for novelty detection are becoming increasingly popular in recent literature.Similar to standard supervised classifiers, these models are trained on a fully labeled dataset and subsequently employed to group unlabeled data.In addition, novelty detectors allow for the discovery and consequent classification of samples showcasing patterns not previously observed in the learning phase.For example, novelty detection models have been successfully employed in uncovering unknown adulterants in food authenticity studies [Cappozzo et al., 2020, Fop et al., 2022], collective anomalies in high energy particle physics [Vatanen et al., 2012], and novel communities in social network analyses [Bouveyron, 2014].Formally, we define a novelty detector as a classifier trained on a set (the training set) characterized by a specific number of classes that is used to predict the labels of a second set (the test set).For a detailed account of the topic, the interested reader is referred to the reviews of Markou and Singh [2003b] and Markou and Singh [2003a], where the former is devoted explicitly to statistical methods for novelty detection.
Within the statistical approach, different methodologies have been proposed to construct novelty detectors.Recently, Denti et al. [2021] introduced a Bayesian Robust Adaptive Novelty Detector (Brand).Brand is a semiparametric classifier divided into two stages, focused on training and test datasets, respectively.In the first stage, a robust estimator is applied to each class of the labeled training set.By doing so, Brand recovers the representative traits of each known class.The second step consists in fitting a semiparametric nested mixture to the test set: the hierarchical structure of the model specifies a convex combination between terms already observed in the training set and a novelty term, where the latter is decomposed into a potentially unbounded number of novel classes.At this point, Brand uses the information extracted from the first phase to elicit reliable priors for the known components.
In their article, Denti et al. [2021] devised an MCMC algorithm for posterior estimation.The algorithm is based on a variation of the slice sampler [Kalli et al., 2011] for nonparametric mixtures, which avoids the ex-ante specification of the number of previously unseen components, reflecting the expected ignorance about the structure of the novelty term.As a result, the algorithm obtains good performance as it targets the true posterior without resorting to any truncation.Nonetheless, as is often the case when full MCMC inference is performed, it becomes excessively slow when applied to large multidimensional datasets.In this paper, we aim to scale up the applicability of Brand by adopting a variational inference approach, vastly improving its computational efficiency.Variational inference is an estimating technique that approximates a complex probability distribution by resorting to optimization [Jordan et al., 1999, Ormerod andWand, 2010], which has received significant attention within the statistical literature in the past decade [Blei et al., 2017].In the Bayesian set-up, the distribution of interest is, of course, the posterior distribution.Variational inference applied to Bayesian problems (also known as variational Bayes, VB from now on) induces a paradigm shift in the approximation of a complicated posterior: we switch from a simulation problem (MCMC) to an optimization one.Broadly speaking, the VB approach starts with the specification of a class of simple distributions called the variational family.Then, within this specified family, one looks for the member that minimizes a suitable distributional divergence (e.g., the Kullbak-Lielber divergence) from the actual posterior.By dealing with a minimization problem instead of a simulation one, we can considerably scale up the applicability of Brand, obtaining results for datasets with thousands of observations measured in high-dimension in a fraction of the time needed by MCMC techniques.Variational techniques have showcased the potential to enhance the relevance of the Bayesian framework even to large datasets, sparking interest in its theoretical properties [e.g., Wang and Titterington, 2012, Nieman et al., 2022, Ray and Szabó, 2022], and its applicability to, for instance, logistic regression [Jaakkola andJordan, 2000, Rigon, 2019], network analysis [Aliverti and Russo, 2022], and, more in general, non-conjugate [Wang, 2012] and advanced models [Zhang et al., 2019].
The present paper is structured as follows.In Section 2, we review Brand and provide a summary of the variational Bayes approach.In Section 3, we discuss the algorithm and the hyperparameters needed for the VB version of the considered model.Section 4 reports extensive simulation studies that examine the efficiency and robustness of our method.Then, in Section 5, we present an application to the Statlog dataset, openly available from the UCI dataset repository.In this context, novelty detection is used to discover novel types of soil starting from satellite images.We remark that this analysis would have been prohibitive with a classical MCMC approach.Lastly, Section 6 concludes the manuscript.
2 Background: Bayesian novelty detection and variational Bayes This section introduces the two stages in which Brand articulates and briefly presents the core concepts of the variational Bayes approach.

The Brand model
In what follows, we review the two-stage procedure proposed in Denti et al. [2021].The first stage centers around a fully labeled learning set from which we extract robust information to set up a Bayesian semiparametric model in the second stage.More specifically, consider a labeled training dataset with n observations grouped into J classes.This paper will focus on classes distributed as multivariate Gaussians, but one can readily extend the model using different distributions.To this regard, we will write N (• | Θ) to indicate a Multivariate Gaussian density with generic location and scale parameters Θ = (µ, Σ).Within each of the J training classes, we separately apply the Minimum Regularized Covariance Determinant (MRCD) estimator [Boudt et al., 2020] to retrieve robust estimates for the associated mean vector and covariance matrix.In detail, the MRCD searches for a subset of observations with fixed size whose regularized sample covariance has the lowest possible determinant.MRCD estimates are then defined as the multivariate location and regularized scatter based on the so-identified subset.Compared to the popular Minimum Covariance Determinant [Rousseeuw andDriessen, 1999, Hubert et al., 2018], MRCD takes advantage of regularization with a prespecified positive definite target matrix to ensure the existence of the solution even when the data dimension exceeds the sample size.This feature implies that the robustness properties of the original procedure are preserved whilst being applicable also to "p > n" problems.Such a characteristic is paramount in the context considered in the present paper, where we specifically aim to scale up the applicability of Brand to high-dimensional scenarios.We will denote the estimates obtained for class j, j = 1, . . ., J, with ΘMRCD j = μMRCD j , ΣMRCD j .These quantities will be employed to elicit informative priors for the J known components in the second stage.Specifically, in so doing, outliers and label noise that might be present in the training set will not hamper the Bayesian mixture model specification hereafter reported.
In the second stage, we estimate a semiparametric Bayesian classifier on a test set with M unlabeled observations.We want to build a novelty detector, i.e., a model that can discern between "known" units -which follow a pattern already observed in the training set -or "novelties".At this point, the likelihood for each observation is a simple two-group mixture between a generic novelty component f nov and a density reflecting the behavior found in the training set f obs .Given stage 1, it is immediate to specify f obs as a mixture of J multivariate Gaussians, whose hyperpriors elicitation will be guided by each of the robust estimates ΘMRCD j .Therefore, we can now write the likelihood for the generic observation y m , where m = 1, . . ., M , as a mixture of J + 1 distributions: (1) Estimating this model allows us to either allocate each of the M observations into one of the previously observed J Gaussian classes or flag it as a novelty generated from an unknown distribution f nov .The specification of f nov is more delicate.To specify a flexible distribution that would reflect our ignorance regarding the novelty term, we employ a Dirichlet process mixture model with Gaussian kernels [Escobar and West, 1995].It is a mixture model where the mixing distribution is sampled from a Dirichlet process [Ferguson, 1973], characterized by concentration parameter γ and base measure H.In other words, we have that , where the atoms are sampled from the base measure, i.e., Θ nov ∼ H, and the mixture weights follow the so-called stick-breaking construction [Sethuraman, 1994], where w k = v k l<k (1−v l ) and v l ∼ Beta(1, γ).To indicate this process, we will write ω ∼ SB(γ).Thus, the likelihood in (1) becomes, for m = 1, . . ., M , This nested-mixture expression of the likelihood highlights a two-fold advantage.First, it is highly flexible, effectively capturing departures from the known patterns and flagging them as novelties.Second, the mixture nature of f nov allows to automatically cluster the novelties, capturing potential patterns that may arise.Furthermore, clusters in the novelty terms characterized by very small sizes can be interpreted as simple outliers.
Notice that the previous nested-mixture likelihood can be conveniently re-expressed as: where, for k ∈ N, we define πk = π k 1{0<k≤J} (π 0 ω k−J ) 1{k≥J} and Θk = (Θ obs k ) 1{0<k≤J} (Θ nov k ) 1{k≥J} .Note that, without loss of generality, we regard the first J mixture components as the known components and all the remaining ones as novelties.
Finally, as customary in mixture models, to ease the computation, we augment the model by introducing the auxiliary variables ξ m ∈ N, for m = 1, . . ., M , where ξ m = l means that the m-th observation has been assigned to the l-th component.Therefore, the model in (2) simplifies to We complete our Bayesian model with the following prior specifications for the weights and the atoms: where N IW indicates a Normal Inverse-Wishart distribution.To ease the notation, let . We remark that the values for the hyperparameters { obs k } J k=1 are defined according to the robust estimates obtained in the first stage.See Denti et al. [2021] for more details about the hyperparameters specification.
Given the model, it is easy to derive the joint law of the data and the parameters, which is proportional to the posterior distribution of interest p (Θ, π, ξ | y).Therefore, the posterior we target for inference is proportional to: The left panel of Figure 1 contains a diagram that summarizes how Brand works.In the following, we will devise a VB approach to approximate (5) in a timely and efficient manner.The following subsection briefly outlines the general strategy underlying a mean-field variational approach, while the thorough derivation of the algorithm used to estimate Brand is deferred to Section 3.

A short summary of mean-field variational Bayes
Working in a Bayesian setting, we ultimately seek to estimate the posterior distribution to draw inference.Unfortunately, the expression in ( 5) is not available in closed form, and therefore we need to rely on approximations.MCMC algorithms, which simulate draws from (5), could be prohibitively slow when applied to large datasets.To this extent, we resort to variational inference to recasts the approximation of ( 5) into an optimization problem.For notational simplicity, while we present the basic ideas of VB, let us denote a generic target distribution with p(θ | y) ∝ p(θ, y).
As in Blei et al. [2017], we focus on a mean-field variational family Q, a set containing distributions for the parameters of interest θ that are all mutually independent: The postulated independence dramatically simplifies the problem at hand.Notice that each member of Q depends on a set of variational parameters denoted by ζ.
We seek, among the members of this family, the candidate that provides the best approximation of our posterior distribution p(θ | y).Herein, the goodness of the approximation is quantified by the Kullback-Leibler (KL) divergence.Thus, we aim to find the member of the variational family Q that minimizes the KL divergence between the actual posterior distribution and the variational approximation.The KL divergence D KL (• || •) can be written as . Unfortunately, we cannot easily compute the evidence p(y).However, the evidence does not depend on any variational parameter and can be treated as fixed w.r.t.θ during the optimization process.We then reformulate the problem into an equivalent one, the maximization of the Evidence Lower Bound (ELBO), which is fully computable: Maximizing ( 6) is equivalent to minimize the aforementioned KL divergence.
To detect the optimal member q ∈ Q, we employ a widely used algorithm called Coordinate Ascent Variational Inference (CAVI).It consists of a one-variable-at-a-time optimization procedure.Indeed, exploiting the independence postulated by the mean-field property, one can show that the optimal variational distribution for the parameter θ j is given by: where θ −j = {θ l } l =j and the expected value is taken w.r.t. the densities of θ −j .The CAVI algorithm iteratively computes (7) for every j until the ELBO does not register any significant improvement.The basic idea behind the CAVI algorithm is depicted in the right-half of Figure 1.
In the next subsections, we will derive the CAVI updates for the variational approximation of our model, along with the corresponding expression of the ELBO.We call the resulting algorithm the variational Brand, or VBrand.

Variational Bayes for the Brand model
In this section, we tailor the generic variational inference algorithm to our specific case.First, we write our variational approximation, highlighting the dependence of each distribution on its specific variational parameters, collected in ζ = η, ϕ, a, b, ρ obs , ρ nov .The factorized form we adopt is: In Equation ( 8), we truncated the stick-breaking representation of the Dirichlet Process on the novelty term at a pre-specified threshold T , as suggested in Blei and Jordan [2006].This implies that q(v T = 1) = 1 and that all the variational mixture weights indexed by t > T are equal to zero.Then, we can exploit a key property of VB.Note that all the full conditionals of the Brand model have closed-form expressions (c.f.r.Section S1 of the Supplementary Material) and belong to the exponential family.This feature greatly simplifies the search for the variational solution.Indeed, it ensures that the corresponding optimal variational distributions belong to the same family of the corresponding full conditional, with properly updated variational parameters.Therefore, we can already state that q η (π) is the density function of a Dirichlet distribution, q ϕ (m) (ξ (m) ) follows a Categorical distribution, each q a k ,b k (v k ) follows a Beta distributions, and q ρ obs k (Θ obs k ) and q ρ nov k (Θ nov k ) are both distributed according to a Normal Inverse-Wishart.

The CAVI parameters update
Once the parametric expressions for the members of the variational family are obtained, we can derive the explicit formulas to optimize the parameters via the CAVI algorithm.In this subsection, we state the updating rules that have to be iteratively computed to fit VBrand.As we will observe, the set of responsibilities {ϕ (m) k } J+T k=1 , i.e., the variational probabilities ϕ (m) k = q(ξ (m) = k), will play a central role in all the steps.In detail, the CAVI steps are as follows: 1. q η (π) is the density of a Dirichlet(η 0 , η 1 , . . ., η J ), where Here, each hyperparameter linked to a known component is updated with the sum of the responsibilities of the data belonging to the same specific known component.Likewise, the variational novelty probability hyperparameter η 0 contains the sum of the responsibilities of all data belonging to all the novelty terms.

Each q
The update for these variational parameters is given by: Here, as expected from the stick-breaking construction, the first parameter of the variational Beta distribution is updated with the sum of the probabilities of each point belonging to a specific novelty cluster k.At the same time, the second parameter is updated with the sum of variational probabilities of belonging to one of the next novelty components.
3. Both q ρ nov k (Θ nov k ) and q ρ obs k (Θ obs k ) are Nomal Inverse-Wishart densities.Let us start with the updating rules of the known components.Note that each variational parameter ρ obs k is a shorthand for m obs k , obs k , u obs k , S obs k .These parameters have the same interpretation as the parameters of (4), contained in k .So we have where we defined Σ(m k .The update for the parameters in ρ nov k follows the same structure, with the hyperprior parameters in nov carefully substituted to obs .

Updating the responsibilities {ϕ (m)
k } J+T k=1 for m = 1, . . ., M, is the most challenging step of the algorithm, given the nested nature of the mixture in (2).We recall that, for a given m, the distribution q ϕ (m) (ξ (m) ) is categorical with J + T levels.Thus, we need to compute the values for the J + T corresponding probabilities.For the known classes k = 1, . . ., J, we have log ϕ while for the novelty terms k = J + 1, . . ., J + T , we have log ϕ For the sake of conciseness, we report the explicit expression for all the terms of ( 12) and ( 13) in Section S2 of the Supplementary Material.This means that the probability of datum y m to belong to cluster k depends on the likelihood of y m under that same cluster and on the overall relevance of k th cluster.Such relevance is determined as the expected value of the relative component of the Dirichlet distributed π and, for novelties, the stick-breaking weight, here unrolled in its Betadistributed components.

The expression of the ELBO for VBrand
In this subsection, we report the terms that need to be derived to obtain the ELBO in Equation ( 6) for VBrand model.We start by computing the first term of Equation ( 6), which takes the following form: where the quantities {f k } 8 k=1 have the following expressions (note we have suppressed the superscripts to ease the notation, and that ψ(•) indicates the digamma function): Moreover, we have and, lastly, The second term of Equation ( 6) can be written as and, finally, Additional details are deferred to Section S2 of the Supplementary Material.
At each step of the algorithm, one needs to update the parameters according to the rules (9)-( 13) and to evaluate all the terms in {f k } 8 k=1 and {h k } 5 k=1 .Albeit faster than MCMC, this articulates over numerous steps.To further reduce the overall computing time, an R package relying on an efficient C++ implementation has been implemented.The package is openly available at the GitHub repository JacopoGhirri/VarBRAND.In the same repository, the interested reader can find all the R scripts written to run the simulation studies and real data application that we will discuss in the following sections.

Simulation studies
In this section, we report the performance of our variational algorithm on a range of simulated datasets.In particular, our simulation study is articulated into three different experiments.
The first experiment focuses on the scaling capabilities of our proposal; the second compares the results and efficiency of VBrand with the original slice sampler, while the last one assesses the sensitivity of the recovered partition to the hyperprior specification.For the MCMC algorithm, we follow the default hyperprior setting suggested in Denti et al. [2021].Moreover, we run the slice sampler for 20,000 iterations, discarding the first 10,000 as burn-in.As for VBrand, we use the same hyperprior specifications used for the MCMC (unless otherwise stated), while we use k-means estimation to initialize the novelty terms' means.We set a threshold ε = 10 −9 as stopping rule.

Classification performance
We test the classification performance of our proposal by applying VBrand to a sequence of increasingly complex variations of a synthetic dataset.We monitor the computation time and different clustering metrics to provide a complete picture of the overall performance.In particular, we compute the Adjusted Rand Index (ARI, Hubert and Arabie [1985]), the Adjusted Mutual Information (AMI, Vinh et al. [2009]), and the Fowlkes-Mallows Index (FMI, Fowlkes and Mallows [1983]).While the first two metrics correct the effect of agreement solely due to chance, the latter performs well also if noise is added to an existing partition.Thereupon, the joint inspection of these three quantities aim to provide a complete picture of the results.
The considered data generating process (DGP) for this experiment is based on a mixture of 7 bivariate Normals.In detail, the first three components represent the known classes, appearing in both training and test sets, while the remaining 4 are deemed to be novelties, present only in the test set.The components are characterized by different mean vectors µ k , correlation matrices (σ 2 k , ρ k ), and cardinalities in the training (n T r ) and test (n T e ).The main attributes of this DGP are summarised in the first block, named SS1, of Table 1.
Starting from this basic mechanism, we subsequently increment the difficulty of the classification tasks.Specifically, we increase both the data dimension and the sample size as follows: • Sample size: while keeping unaltered the mixture proportions, we consider the sample sizes ñk = q • n k , with multiplicative factor q ∈ {0.5, 1, 2.5, 5, 10} in both the training and test sets; • Data dimensionality: we augment the dimensionality p of the problem by considering p ∈ {2, 3, 5, 7, 10}.Each added dimension (above the second) comprises independent realization from a standard Gaussian.Note that the resulting datasets define a particularly challenging discrimination task: all the information needed to distinguish the different components is contained in the first two dimensions.In contrast, the remaining ones only display overlapping noise.
For each combination of sample size and dataset dimension, we generate 50 replications of each simulated dataset and summarize the results by computing the means and the standard errors of the chosen metrics.Results for this experiment are reported in Figure 2. We immediately notice that the clustering performances deteriorate as the dimensionality of the problem increases.This trend is expected, especially given the induced overlap in the added dimensions.However, the classification abilities of our method remain consistent and satisfactory across all metrics.Indeed, as we can see from the three panels at the top, ARI, AMI, and FMI are all strictly above 70% across all scenarios.This outcome indicates that not only are the known classes correctly identified and clustered as such, but the flexible nonparametric component effectively captures the novelty term.The computation time (bottom panel) grows exponentially as a function of the test set cardinality.Interestingly, the increment of data dimensionality does not significantly impact the computational costs, suggesting an effective scalability of our proposal to high dimensional problems.Indeed, even when the test size is in the order of tens of thousands, the devised CAVI algorithm always reaches convergence in less than half a minute.Lastly, we remark that the time needed for convergence is sensitive to the different initialization provided to the algorithm, explaining the high variance that characterized the computational costs.

Comparison with MCMC
We now compare the variational and the MCMC approaches for approximating the Brand posterior.The MCMC algorithm we consider is the modified slice-sampler introduced in Denti et al. [2021].We compare the estimating approaches leveraging on the same DGP highlighted in the previous section and slightly modified accordingly.In detail, we consider five spherical bivariate Gaussian to generate the classes, out of which three are deemed as novelties: the basic details of the resulting DGP can be found in the second block of Table 1 (SS2).We consider 50 different scenarios resulting from the interactions of the levels of following three attributes: • Simple vs. complex scenarios: we set the variance of all the mixture components to either σ 2 S = 0.2 or σ 2 C = 0.75 (the only exception being σ 2 C * 5 = 0.375).The former value implies clear separation among the elements in the simple scenario.In contrast, the latter variance defines the complex case, where we induce some overlap that may hinder the classification.A descriptive plot is displayed in Section S3 of the Supplementary Material; • Sample size: we modify the default sample size n = 1000 by considering different multiplicative factors in q ∈ {0.5, 1, 2.5, 5, 10}, thus obtaining datasets ranging from 500 to 10000 observations.• Data dimensionality: we augment the dimensionality p of the problem by considering p ∈ {2, 3, 5, 7, 10}.The dimensionality augmentation is carried out as described in Section 4.1.
We assess the classification performances with the same metrics previously introduced.For each of the 50 simulated scenarios, we perform 50 Monte Carlo replicates to assess the variation in the performances.A summary of the classification results under the complex scenario are reported in Figure 3.The panels show that the slice sampler always outperforms the VB implementation in the two-dimensional case.However, as the dimensionality of the dataset increases the MCMC performance rapidly drops, while VBrand always obtains good clustering recovery, irrespective of the data dimensionality and the sample size.Similar results are obtained under the simple scenario, for which a summarizing plot can be found in Section S3 of the Supplementary Material.Figure 4  time.As expected, VBrand provides results in just a fraction of the time required by the MCMC approach, being approximately two orders of magnitude faster.These results cast light not only on the apparent gain in computational speed when using the variational approach, which is expected by any means, but also on the superior recovering of the underlying partition in the test set in more complex scenarios.q = 0.5 q = 1 q = 2.5 q = 5 q = 10 Simple scenario

Sensitivity Analysis
Finally, we investigate the sensitivity of the model classification to different hyperprior specifications.Two pairs of crucial hyperparameters may considerably affect the clustering results.The first two are α and γ, which drive the parametric and nonparametric mixture weights, respectively.The second pair is given by the N IW precision λ nov 0 and degrees of freedom ν nov 0 of the novelty components.To assess their impact, we devise a sensitivity analysis considering each possible combination of the following hyperparameters: α ∈ {0.1, 0.55, 1}, γ ∈ {1, 5.5, 10}, λ nov 0 ∈ {1, 5, 10}, and ν nov 0 ∈ {4, 52, 100}, thus defining 81 scenarios.We fit VBrand to a dataset composed of five Normals in a bivariate domain, considering a fixed sample size for both training and test sets.We chose a small sample size for the training set to limit the informativeness of the robust estimation procedure.Moreover, for each combination of the hyperparameters, we consider both low and high values for the variances of the mixing components, obtaining scenarios with low overlap (LOV) and high overlap (HOV), respectively.Additional details about the data-generating process can be found in the third block of Table 1 ( SS3).For this experiment, we compare the retrieved partitions in terms of ARI, as done in Sections 4.1 and 4.2, and by monitoring the F 1 score, i.e., the harmonic mean of precision and recall.The results are reported in Figure 5.We immediately notice by inspecting the panels that for the LOV case the method performs well regardless of the combination of hyper-parameters chosen for the prior specification.In the HOV scenario the recovery of the underlying true data partition is less effective, as it is nevertheless expected.
In particular, it seems that setting a high value for the degrees of freedom ν nov 0 produces a slight drop in the ARI metric.This behavior is due to the extra flexibility allowed to the novelty component, by which some of the units belonging to the known groups are incorrectly captured by the novelty term.
[2021], allowing for the original novelty detector to be successfully applied to complex scenarios in a more timely manner, without being it affected by the hyper-parameters specification.

Application to novel soil type detection
For our application, we consider the Statlog (Landsat Satellite) Data Set, publicly obtained from the UCI machine-learning repository 1 .It consists of a collection of observations from satellite images of different soils.Each image contains four spectral measurements over a 3x3 grid, recorded to classify the soil type captured in the picture.There are a total of six different soil types recorded: Red Soil (RS), Grey Soil (GS), Damp Grey Soil (DGS), Very Damp Grey Soil (VDGS), Cotton Crop (CC), and Soil with Vegetation Stubble (SVS).We frame the original classification problem in a novelty detection task by removing the images of CC and SVS from the training set, leaving these groups in the test set to be detected as novelties.
Even when performing a simple classification, a method that can account for the possible presence of previously unseen scenarios can be of paramount utility in many fields.For example, new plants [Christenhusz and Byng, 2016], animals [Camilo et al., 2011], and viruses [Woolhouse et al., 2012] are progressively discovered every year.Likewise, related to our application, landscapes present novelties at increasing rates [Finsinger et al., 2017].Moreover, a scalable model that can discern and separate outlying observations is necessary when dealing with real-world data, allowing the results to be robust to outliers or otherwise irregular observations.Once these observations are flagged, they can be the objective of future investigations.Thus, our novelty detection application to the Statlog dataset is a nontrivial example that may inspire future applications of our method.
The original data are already split into training and test sets.After removing the CC and SVS classes from the training set, we obtain a training set of 3486 observations.The test set instead contains 2000 instances.Each observation includes the four spectral values recorded over the 9-pixel grids.Therefore, we will model these data with a semiparametric mixture of 36-dimensional multivariate Normals.Given the large dimensionality of the dataset, the application of the MCMC estimation approach is problematic in terms of both required memory and computational time.Indeeed, estimating the model via slice sampler becomes unfeasible for most laptop computers.Moreover, we recall that the MCMC approach showed some numerical instabilities in our simulation studies when applied to large dimensional datasets.
We apply VBrand adopting a mixture with full covariance matrices to capture the potential dependence across the different pixels.Being primarily interested in clustering, we first rescale the data to avoid numerical problems.In detail, we divided all the values in both training and test by 4.5 to reduce the dispersion.Indeed, before the correction, the variabilities within groups ranged from 25.40 to 228.04.After, the within-group variability ranges from 1.25 to 11.26, significantly improving the stability of the algorithm.
Since the variational techniques are likely to find a locally optimal solution, we run the CAVI algorithm 200 times adopting different initializations.For each run, we obtain different random starting point as follows: • we set the centers for the novelty NIWs equal to the centers returned by a k-means algorithm performed over the whole test set, with k being equal to the chosen truncation; • the Dirichlet parameters, ν nov and λ nov are randomly selected.In particular, we sample the Dirichlet parameters from (0.1, 1) and the N IW parameters from (1, 10) through Latin Hypercube sampling; The other variational hyperparameters are assumed fixed, equal to the corresponding prior hyperparameters.In Section S4 of the Supplementary Material, we report a detailed list of the hyperparameter specifications that we adopted for this analysis.As a final result, we select the run with the highest ELBO value, being the one whose variational distribution has the lowest KL divergence from the true posterior.The top-left panel of Figure 6 shows the ELBO trends for all the run we performed.The bottom panel of Figure 6 reports the resulting confusion matrix: we observe that the algorithm successfully detected both novelties, achieving a satisfactory classification performance of the previously observed soil types.However, the model struggles with classifying the DGS instances, often mistaken for GS or VDGS.Such difficulty is explained by the overlap between these groups, as shown by the visualization of the test set obtained via the tSNE projection [Hinton and van der Maaten, 2008], reported in the top-right panel of Figure 6: from the plot, we see that it is not straightforward to establish clear boundaries between GS, DGS, and VDGS soil types.Overall, VBrand captures the main traits of the data and flags some observations as outliers (e.g., Novelty clusters 4 and 5), which may warrant further investigation.All in all, our variational approach provides a good clustering solution in a few seconds and it is fast enough to allow for a brute-force search for a better, albeit only locally optimal, solution employing multiple initializations.

Discussion and conclusions
In this paper we introduced VBrand, a variational Bayes algorithm for novelty detection, to classify instances of a test set that may conceal classes not observed in a training set.We showed how VBrand outperforms the previously proposed slice sampler implementation in terms of both computational time and robustness of the estimates.The application to soil data provides an example of the versatility of our method in a context where the MCMC algorithm fails because of the large dimensionality of the problem.Our results pave the way for many possible extensions.First, given recent developments in Bayesian nonparametric literature, the variational algorithm can be enriched by adding a hyperprior distribution for the concentration parameter of the novelty DP [Ascolani et al., 2022].While in practice VBrand already obtains very good classification performance, this addition would lead to a consistent model for the number of true clusters.Second, we can consider different likelihood specifications and develop variational inference novelty detectors for, but not limited to, functional or graphical data.Third, at the expense of efficiency, we can explore more complex specifications for the variational distributions, as in structured variational inference.Albeit potentially slower, this choice would lead to an algorithm that could better capture the complex structure of the posterior distribution we are targeting.Finally, we can resort to stochastic variational inference algorithms [Hoffman et al., 2003], to scale up VBrand's applicability to massive datasets that could benefit from novelty detection techniques.
• E[log v k−J ] = ψ(a k−J ) − ψ(a k−J + b k−J ) and E[log (1 − v l )] = ψ(b l ) − ψ(a l + b l ), • If Σ −1 ∼ Wishart, we have that: Given the previous results and the updating rules derived in Section S1, we can easily derive the value for the negative entropy of a multivariate Normal distribution for q ∈ {obs, nov}:

S4 Hyperparameter initializations for the novel soil detection analysis
We report a detailed list of the hyperparameters initialization of the VBrand model used for the novelty detection of soil type in Section 5 of the main paper.
1.The truncation is set at T = 10.
2. The concentration parameter of the stick-breaking process is set to γ = 5.
3. The hyperparameters of the Dirichlet distribution are all set to α j = 0.1 ∀j, inducing the same probability of being in each known component and of being a novelty.
4. The N IW prior for the novelty locations is parameterized as follows: µ nov 0 is set equal to the grand mean of the training set, to allow an hyperprior specification that does not rely on the test set; the precision is given as λ nov 0 = 0.1,the degrees of freedom ν nov 0 = p + 2, where p is the dimensionality of the dataset.This is the minimum integer value that is allowed to have so that the IW distribution has a finite mean.Finally, the variance is set to be (p + 1) times the training sample covariance matrix.This choice implies that the expected variance sampled from the N IW is an inflated version of the overall sample covariance matrix of the training set. 5.For the known components, the mean vector, and the variance-covariance matrices are estimated via the MRCD procedure.To put high mass on these estimates, for all the known groups, we set λ obs k = 200 and ν obs k = p + 1 + 200.

Figure 2 :
Figure 2: Performance metrics and elapsed time obtained by the VBrand algorithm stratified by number of variables p and size scaling n.The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors.

Figure 3 :
Figure3: Performance metrics (MCMC in red, variational Bayes -VB -in blue) stratified by number of variables p and sample size scaling factor q under the complex scenario.The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors.

Figure 4 :
Figure 4: Computational time in seconds (MCMC in red, variational Bayes -VB -in blue) grouped by number of variables p, sample size scaling factor q, and type of scenario.The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors.

Figure 5 :
Figure 5: Classification metrics obtained over 50 replicates for 81 combinations of the hyperparameters α, γ, λ nov 0 , and ν nov 0 under the low overlap (LOV) and high overlap (HOV) cases.The dots represent the averages obtained over 50 replicates of the simulated experiment, while the vertical bars display the associated standard errors.

Figure 6 :
Figure 6: Top-left panel: collection of ELBO trajectories obtained via CAVI updates starting from 200 different random intializations; the trajectory providing the highest ELBO is highlighted in blue (the y axis is truncated for improved visualization).Top-right panel: projection of the test dataset onto a two-dimensional space via the tSNE algorithm.Bottom panel: heatmap of the resulting confusion matrix.

S2. 2 Figure 7 :
Figure 7: Example of the first two dimensions of the generated data under the simulation study described in Section 4.2 of the main paper.

Table 1 :
Characteristics of the synthetic datasets used in the three simulation studies (SS1−SS3).The components flagged with * are novelties.† the variances reported in this table refer to the low overlap scenario.For the high overlap scenario, we consider