A two-stage Bayesian semiparametric model for novelty detection with robust prior information

Novelty detection methods aim at partitioning the test units into already observed and previously unseen patterns. However, two significant issues arise: there may be considerable interest in identifying specific structures within the novelty, and contamination in the known classes could completely blur the actual separation between manifest and new groups. Motivated by these problems, we propose a two-stage Bayesian semiparametric novelty detector, building upon prior information robustly extracted from a set of complete learning units. We devise a general-purpose multivariate methodology that we also extend to handle functional data objects. We provide insights on the model behavior by investigating the theoretical properties of the associated semiparametric prior. From the computational point of view, we propose a suitable ξ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document}-sequence to construct an independent slice-efficient sampler that takes into account the difference between manifest and novelty components. We showcase our model performance through an extensive simulation study and applications on both multivariate and functional datasets, in which diverse and distinctive unknown patterns are discovered.


Introduction
Supervised classification techniques aim at predicting a qualitative output for a test set by learning a classifier on a fully-labeled training set. To this extent, classical methods assume that the labeled units are realizations from each and every sub-groups in the target population. However, many real datasets contradict these assumptions. As an example, one may think about an evolving ecosystem in which novel species are likely to appear over time. In other words, basic classifiers cannot handle the presence of previously unobserved -or hidden -classes in the test set. Novelty detection methods, also known as adaptive methods, address this issue by modeling the presence of classes in the test set that have not been previously observed in the training. Relevant examples of this type of data analysis include, but are not limited to, radar target detection (Carpenter et al., 1997), detection of masses in mammograms (Tarassenko et al., 1995), handwritten digit recognition (Tax and Duin, 1998) and e-commerce (Manikopoulos and Papavassiliou, 2002), for which labeled observations may not be available for every group.
Within the model-based family of classifiers, adaptive methods recently appeared in the literature. Miller and Browning, 2003 pioneer a mixture model methodology for class discovery. Bouveyron, 2014 introduces an adaptive classifier in which two algorithms, based respectively on transductive and inductive learning, are devised for inference. More recently, Fop et al., 2021 extend the original work of Bouveyron, 2014 by accounting for unobserved classes and extra variables in high-dimensional discriminant analysis.
Classical model-based classifiers are not robust, as they lack the capability of handling outlying observations in the training and in the test set. On the one hand, the presence of outliers in the training set can significantly alter the learning phase, resulting in poorly representative classes and therefore jeopardizing the entire classification process. In the training set, we identify as outliers units with implausible labels and/or values. Cappozzo et al., 2020 extend the work of Bouveyron, 2014 addressing this problem by using a robust estimator that relies on impartial trimming (Gordaliza, 1991). In short, the most unlikely data points under the currently estimated model are discarded. On the other hand, dealing with outliers in the test set is a more delicate task. Ideally, we would like to distinguish between novelties, i.e., test observations displaying a common, specific pattern, and anomalies, i.e., test observations that can be regarded as noise. While the distinction between novel and anomalous entities is most often apparent in practice, there exist some circumstances under which such separation is vague and somewhat philosophical. Let us go back to the aforementioned evolving ecosystem example. It may happen that at an early instant, a real novelty is mistaken to be mere noise due to its embryonic stage. Contrarily, if we fitted the same model at a later time point, the increased sample size could be sufficient to acknowledge an actual novel species.
To address the discussed challenges, we propose a two-stage Bayesian semiparametric novelty detector. We devise our model to sequentially handle the outliers in the training set and the latent classes in the test set. In the first stage, we learn the main characteristics of the known classes (for example, their mean and variance) from the labeled dataset using robust procedures. In the second phase, we fit a Bayesian semiparametric mixture classes and outliers. To do so, we devise a two-stage strategy. The first phase, described in Section 2.1, relies on a class-wise robust procedure for extracting prior information from the training set. Then, we fit a Bayesian semiparametric mixture model to the test units. A full account of its definition is reported in Section 2.2. A diagram summarizing our modeling proposal is reported in Figure 1.

Stage I: Robust extraction of prior information
The first step of our procedure is designed to obtain reliable estimatesΘ j for the parameters of the observed class j, j = 1 . . . , J, from the learning set. To this aim, one could employ standard methods as the maximum likelihood estimator, or different posterior estimates under the Bayesian framework. Nonetheless, these standard approaches are not robust against contamination, and the presence of only a few outlying points could entirely bias the subsequent Bayesian model, should the informative priors be improperly set. We report a direct consequence of this undesirable behavior in the simulation study of Section 6.1. Therefore, we opt for more sophisticated alternatives to learn the structure of the known classes, employing methods that can deal with outliers and label noise. Particularly, the selected methodologies involve the Minimum Covariance Determinant (MCD) estimator (Rousseeuw, 1984;Hubert et al., 2018) and, when facing high-dimensional data (as in the functional case of Section 6.3), the Minimum Regularized Covariance Determinant (MRCD) estimator (Boudt et al., 2020). Clearly, at this stage, one can use any robust estimators of multivariate scatter and location for solving this problem: see, for instance, the comparison study reported in Maronna and Yohai (2017) for a non-exhaustive list of suitable candidates.
We decide to rely on the MCD and MRCD for their well-established efficacy in the classification framework (Hubert and Van Driessen, 2004) and direct availability of fast algorithms for inference, readily implemented in the rrcov R package (Todorov and Filzmoser, 2009). We briefly recall the main MCD and MRCD features in the remaining part of this section. For a thorough treatment the interested reader is referred to Debruyne, 2010 andBoudt et al., 2020, respectively. The MCD is an affine equivariant and highly robust estimator of multivariate location and scatter, for which a fast algorithm is available (Rousseeuw and Driessen, 1999). The raw MCD estimator with parameter η M CD ∈ [0.5, 1] such that (n + p + 1)/2 ≤ η M CD N ≤ N defines the following location and dispersion estimates: •μ M CD is the mean of the η M CD N observations for which the determinant of the sample covariance matrix is minimal •Σ M CD is the corresponding covariance matrix, multiplied by a consistency factor c 0 (Croux and Haesbroeck, 1999) with · denoting the floor function. The MCD is a consistent, asymptotically normal and highly robust estimator with bounded influence function and breakdown value equal to (1 − η M CD N /N )% (Butler et al., 1993;Cator and Lopuhaä, 2012). However, a major drawback is its inapplicability when the data dimension p exceeds the subset size η M CD N as the covariance matrix of any η M CD N -subset becomes singular. This situation appears ever so often in our context, as the MCD is group-wise applied to the observed classes in the training set, such that it is sufficient to have p > min n j ,j=1,...,J η M CD n j for the MCD solution to be ill-defined. To overcome this issue, Boudt et al., 2020 introduced the MRCD estimator. The main idea is to replace the subset-based covariance estimation with a regularized one, defined as a weighted average of the sample covariance on the η M CD N -subset and a predetermined positive definite target matrix. The MRCD estimator is defined as the multivariate location and regularized scatter based on the η M CD N -subset that makes its overall determinant the smallest. The MRCD preserves the good breakdown properties of its non-regularized counterpart, and besides, it is applicable in high-dimensional problems where η M CD N is possibly smaller than p.
The first phase of our two-stage modeling thus works as follows: considering the available labels l n , n = 1, . . . , N we apply the MCD (or MRCD) estimator within each class to extractμ M CD j andΣ M CD j , j = 1 . . . , J. For ease of notation, we use superscript 'MCD' for the robust estimates even when we consider its regularized version. In general, if the sample size is large enough, the MCD solution is preferred. There is no reason for η M CD to be the same in all observed classes. If a group is known a priori to be particularly outliers-sensitive, one should set its associated MCD subset size to a smaller value than the remaining ones. However, since this type of information is seldom available, we subsequently let η M CD j = η M CD for all classes in the learning set. This concludes the first stage: the retained estimates are then incorporated in the Bayesian model for the second stage, presented in Section 2.2. The robust knowledge extracted from X is treated as a source of reliable prior information, eliciting informative hyperparameters. In this way, outliers and label noise that might be present in the labeled units will not bias the initial beliefs for the known groups in the second stage, which is the main methodological contribution of the present manuscript. Figure 1: A diagram that summarizes Brand two-stage structure. In Stage I, robust information extraction (via the MCD estimator) is performed and subsequently used to elicit the priors for the mixture model in Stage II. In the second stage, a finite mixture model (FMM) is fitted to the data, distinguishing among known components and novelties. The novelty term is modeled with a Dirichlet Process mixture model (DPMM).
resort to the Dirichlet Process mixture model (DPMM) of Gaussians densities (Lo, 1984;Escobar and West, 1995) imposing the following structure: where DP (γ, H) denotes a Dirichlet Process with concentration parameter γ and base measure H (Ferguson, 1973). Adopting Sethuraman's Stick Breaking construction (Sethuraman, 1994), we can express the likelihood as follows: The term ∞ h=1 ω h φ (·|Θ nov h ) represents a Dirichlet Process realization convoluted with a Normal kernel, for flexibly modeling a potentially infinite number of hidden classes and/or outlying observations. The following prior probabilities for the parameters com-6 plete the Bayesian model specification: Values a 1 , . . . , a J are the hyper-parameters of a Dirichlet distribution on the known classes. We can exploit the learning set to determine reasonable values of such hyperparameters by setting a j = n j /N . The quantity a 0 determines the initial prior belief on how much novelty we are expecting to discover in the test set. Generally, the parameter controlling the novelty proportion a 0 is a priori considered to be small. We adopt a conjugate Normal-inverse-Wishart (NIW) prior for both the location-scale parameters of the manifest and the novel classes. For each of the known groups, we assume that are the MCD robust estimates obtained in Stage I. At the same time, the precision parameter λ T r and the degrees of freedom ν T r are treated as tuning parameters to enforce high mass around the robust estimates. By letting these two parameters go off to infinity, we can also recover the degenerate case P T r j = δΘ j where the Dirac's delta denotes a point mass centered inΘ j . That is, the prior beliefs extracted from the training set can be flexibly updated by gradually transitioning from transductive to inductive inference by increasing λ T r and ν T r (Bouveyron, 2014). Similarly, we set H ≡ N IW (m 0 , λ 0 , ν 0 , S 0 ) , where the hyperparameters are chosen to induce a flat prior for the novel components. Lastly, with ω ∼ SB (γ) we denote the vector of Stick-Breaking weights, composed of elements defined as It is well known that, under the DP specification, the expected number of clusters induced in the novelty term grows as γ log M . We choose the DP mostly for computational convenience: if more flexibility is required, Brand can easily be adapted to accommodate different nonparametric priors, such as the Pitman-Yor process (Pitman, 1995;Pitman and Yor, 1997) or the geometric process and its extensions (De Blasi et al., 2020). To facilitate posterior inference given the specification in (4), we consider the following complete likelihood: where α m ∈ {0, . . . , J} and β m ∈ {0, . . . , ∞} are latent variables identifying the unobserved group membership for y m , m = 1, . . . , M . In details, the former identifies whether observation m is a novelty (α m = 0) or not (α m > 0), whereas the latter defines, within the novelty subset, the resulting data partition (β m > 0). To complete the specification, we set ω 0 = 1. Lastly, we want to underline that there might be some cases where the number of novelty groups is known to be bounded and does not grow with the sample size as in the DP case.
In those situations, an appealing alternative to the DPMM is the Sparse Mixture Model, studied by Rousseau and Mengersen (2011) and recently investigated in Malsiner-Walli et al. (2016).

Distinguishing novelties from anomalies
The advantage of employing a DPMM for the novelty part is twofold: on the one hand, all the data coming from unseen components are modeled with a unique, flexible density. On the other hand, the clustering naturally induced by the DPMM favors the separation of the novelty component into actual unseen classes and outlying units. More specifically, since the concept of an outlier does not possess a rigorous mathematical definition (Ritter, 2014), the estimated sample sizes of the discovered classes act as an appropriate feature for discriminating between scattered outlying units and actual hidden groups. That is, if a component φ (·|Θ nov h ) fits only a small number of data points, we can regard those units as outliers. Similarly, we assume to have discovered an extra class whenever it possesses a substantial structure. In real applications, domain-expert supervision will always be crucial for class interpretation when extra groups are believed to have been detected. While the mixture between known and novel distributions is identifiable and not subjected to the label switching problem, the same cannot be said about the DP component modeling the novelty density. To recover a meaningful estimate for the partition of points regarded as novel (β m > 0) we first compute the pairwise coclustering matrix P = {p m,m }, whose entry p m,m denotes the probability that y m and y m belong to the same cluster. We then retrieve the best partition minimizing the Variation of Information (VI) criterion, as suggested in Wade and Ghahramani (2018). More details on how to post-process the MCMC output are given in Section 5.

Properties of the proposed semiparametric prior
We now investigate the properties of the underlying random mixing measure induced by the model specification we presented in the previous section. All the proofs are deferred to the Supplementary Material. We start by noticing that model in (3)-(4) can be generalized in the following hierarchical form, which highlights the dependence on a 8 discrete random measurep: From (7) we can see how our model is an extension of the contaminated informative priors proposed in Scarpa and Dunson (2009), where the authors propose to juxtapose a single atom to a DP. To simplify the exposition of the results in this section, without loss of generality, we assume that both Θ j and Θ nov h are univariate random variables. Consequently, we suppose that each P T r j is a probability distribution with mean µ j , second moment µ j,2 and variance σ 2 The overall variance can also be written in terms of variances of every observed mixture components: The previous expressions are important to compute the covariance between the two random elements Θ m and Θ m , which helps to understand the behavior ofp. Consider a vector = { j } J j=0 , with the first entry equal to 1 1+γ and the remaining entries equal to 1. Then, a j a l µ j µ l + + a 0 (a 0 + 1) a(a + 1) The covariance is composed of three terms. In the first two, the seen and unseen components have the same influence. The last term is non-negative and entirely determined by quantities linked to the novel part of the model. Notice that if γ → 0 the covariance becomes a j a l µ j µ l which is the same covariance we would obtain ifp =p 0 ≡ J j=0 π j δ Θ j , i.e., if we were dealing with a "standard" mixture model with J + 1 components. This implies that (8) can be rewritten as which leads to a nice interpretation. The introduction of novelty atoms decreases the "standard" covariance. This effect gets stronger as the prior weight given to the novelty component a 0 , the dispersion of the base measure σ 2 0 and/or the concentration parameter γ increases.
Given the discrete nature ofp, we can expect ties between realizations sampled from this measure, say Θ m and Θ m . Therefore, we can compute the probability of obtaining a tie as: where the contribution to this probability of the novelty terms is multiplicatively reduced by a factor that depends on the inverse of the concentration parameter. If a priori we expect a large number of clusters in the novelty term (large γ), the probability of a tie reduces. Indeed, some noticeable limiting cases arise: a j (a j + 1) a(a + 1) .
If γ → 0 we obtain a finite mixture of J + 1 components. Conversely, γ → +∞ leads to the case of a DP with numerous atoms characterized by similar probability, hence annihilating the contribution to the probability of the novelty term. Moreover, suppose we rewrite the distribution of π as Dir a 0 J+1 ,ã J+1 , . . . ,ã J+1 . In this case, the hyperparameters relative to the observed groups are assumed equal toã. Then, we obtain a = a 0 +Jã J+1 , and As J increases, the second part of (10) vanishes. Accordingly, if we suppose an unbounded number of observed groups letting J → ∞, then we have as in the classical DP case, and the model loses its ability to detect novel instances.

Functional Novelty Detection
The modeling framework introduced in Section 2 is very general and can be easily modified to handle more complex data structures. In this section, we develop a methodology for functional classification that allows novelty functional detection, building upon model (3)-(4). We hereafter assume that our training and test instances are error-prone realizations of a univariate stochastic process X (t), t ∈ T with T ⊂ R.
Recently, numerous authors have contributed to the area of Bayesian nonparametric functional clustering (see, for example Bigelow and Dunson, 2009;Petrone et al., 2009;Rodriguez and Dunson, 2014;Rigon, 2019). Canale et al. (2017) propose a Pitman-Yor mixture with a spike-and-slab base measure to effectively model the daily basal body temperature in women by including the a priori known distinctive biphasic trajectory that characterizes healthy beings. Instead of modifying the base measure of the nonparametric process, Scarpa and Dunson (2009) address the same problem by contaminating a point mass with a realization from a DP. As such, part of our method can be seen as a direct extension of the latter, where J ≥ 1 different atoms centered in locations learned from the training set are contaminated with a DP.
Let Θ m (t) = f m (t), σ 2 m (t) denote the vector comprising the smooth functional mean f m : T → R and the measurement noise σ 2 m : T → R + for a generic curve m in the test set, evaluated at the instant t. Then the Brand model, introduced in Section 2.2 for multivariate data, can be modified as follows: (π 0 , π 1 , . . . , π J ) ∼ Dir(a 0 , a 1 , . . . , a J ), ω ∼ SB(γ), where all the distributions P T r j and the base measure H model the functional mean and the noise independently. We propose the following informative prior for Θ j = f j (t), σ 2 j (t) : We denote the estimates obtained from the training set of the mean and variance functions, asf j andσ 2 j , respectively, for each observed class j, with j = 1, . . . , J. The hyper-parameters ϕ j define the degree of confidence we a priori assume for the information extracted from the learning set, while the Inverse Gamma (IG) specification ensures that E σ 2 j (t) =σ 2 j (t) and V ar σ 2 j (t) = v j . It remains to define how we computef j andσ 2 j , that is, how the robust extraction of prior information is performed in this functional extension. Applying standard procedures in Functional Data Analysis (Ramsay, James, Silverman, 2005), we first smooth each training curve x n (t) via a weighted sum of B basis functions is the b-th basis evaluated in t and ρ nb its associated coefficient. Given the acyclic nature of the functional objects treated in Section 6.3, we will subsequently employ B-spline bases (Boor, 2001). Clearly, depending on the problem at hand, other basis functions may be considered. After such representation has been performed, we are left with J matrices of coefficients each of dimension n j × B. By treating them as multivariate entities, as done for example in Abraham et al., 2003, we resort to the very same procedures described in Section 2.1, and we set is the robust location estimate on the n j × B matrix of coefficients, and M CD denotes the subset of untrimmed units resulting from the MCD/MRCD procedure in group j, j = 1, . . . , J. On the other hand, more flexibility is needed to specify the base . Therefore, via the same smoothing procedure considered for the training curves, we build a hierarchical specification for the quantities involved in the novelty term: The first line of (13) can be rewritten as We call this model functional Brand: it provides a powerful extension for functional novelty detection. A successful application is reported in Section 6.3.

Posterior Inference
The posterior distribution p(π, ω, a, β, Θ, Θ nov |y) is analytically intractable, therefore we rely upon MCMC techniques to carry out posterior inference. An easy sampling scheme can be constructed mimicking the blocked Gibbs sampler of Ishwaran and James (2001), where the infinite series in (3) is truncated at a pre-specified level L < ∞. However, this approach leads to a non-negligible truncation error if L is too small, and to computational inefficiencies if L is set too high. Instead, we propose a modification of the ξ-sequence of the Independent Slice-efficient sampler (Kalli et al., 2011), another well known conditional algorithm to sample from the exact posterior. To adapt the algorithm to our framework, we start from the following alternative reparameterization of the model in (3)- (4): whereΘ is obtained by concatenating Θ and Θ nov , δ k is the usual Dirac delta function, the weights π and ω are defined as in Equation (7), and ζ m is a membership label which maps each observation to its corresponding atomΘ ζm .
Trivially, there is a one-to-one correspondence between the membership vectors (α m , β m ) of model (6) and ζ m However, we prefer the form of model (6) thanks to the direct interpretation of the membership latent variables α and β, which associate each observation to the known or novel classes, respectively. We introduce two sequences of additional auxiliary parameters: a stochastic sequence u = {u m } M m=1 of uniform random variables and a deterministic sequence ξ = {ξ l } l≥1 . The introduction of these two latent variables allows for a stochastic truncation at each iteration of the sampler. The stochastic threshold, called L, is given as L = max L m and L m is the largest integer such that ξ Lm > u m . This threshold establishes a finite number of mixture components needed at each MCMC iteration, making computations feasible. Then, we can rewrite model (6) as In the definition of a dedicated deterministic sequence ξ, it is crucial to take into account the difference between the manifest and the novel components. Usually, a very common choice is ξ l = (1 − κ)κ l−1 , with κ ∈ (0, 1). This option allows to compute each L m analytically, being the smallest integer such that However, the default choice of a geometrically decreasing ξ-sequence is inappropriate in this context, since we are dealing with a mixture where not all the components are 13 ... … Figure 2: Example of deterministic sequence defined according to (17), with κ = 0.25. The blue rectangle highlights the weights relative to the known components.
conceptually equivalent. The default ξ-sequence tends to favor components that come first in the mixture specification (in our case, the known ones). To overcome this issue, we propose the following intuitive modification. Given a value for κ ∈ (0, 1), we equally divide the (1 − κ)% of the mass into the first J + 1 elements of the sequence. We then induce a geometric decay in the remaining ones to split the residual fraction κ. We force the element in position J + 1 to have the same mass given to the manifest components, to avoid an under-representation of the novelty part. To do so, we define It is easy to prove that +∞ l=1 ξ l = 1. According to (17), the first J + 1 elements of the sequence have masses equal to (1 − κ)/(J + 1). The truncation threshold L * changes accordingly, becoming the largest integer such that Inequality (18) states that the truncation threshold L * can be only greater or equal to J + 1, ensuring that the MCMC always takes into consideration the creation of at least one cluster in the novel distribution. A representation of the modified ξ-sequence is depicted in Figure 2. We report the pseudo-code for the devised Gibbs sampler in the Appendix. The algorithm for the functional extension is not included for conciseness. However, its structure closely follows the one outlined for the multivariate case.
Once the MCMC sample is collected, we first compute the a posteriori probability of being a novelty for every test unit m, P P N m = P [y m ∼ f nov |Y], that is estimated according to the ergodic mean:P where α (i) m is the value assumed by the parameter α m at the i-th iteration of the MCMC chain and I is the total number of iterations. We remark that the inference on α can be conducted directly, since the mixture between the J observed components and f nov is not subjected to label switching. In contrast, we need to take this problem into account when dealing with β. To perform valid inference, one possibility is to rely on the posterior probability coclustering matrix (PPCM) as indicated in Section 2.3. Each entry of this matrix p m,m = P [y m and y m belong to the same novelty class] is estimated aŝ Once we obtain the PPCM, we employ it to estimate the best partition (BP) in the novelty subset. Indeed, one can recover the BP by minimizing a loss function defined over the space of partitions, which can be computed starting from the PPCM. A famous loss function was proposed by Binder (1978), and investigated in a BNP setting by Lau and Green (2007). However, the so-called Binder loss presents peculiar asymmetries, preferring to split clusters over merging. These asymmetries could result in a number of estimated clusters higher than needed. Therefore, we adopt the Variation of Information (VI - Meilǎ, 2007) as loss criterion. The associated loss function, recently proposed by Wade and Ghahramani (2018), is known to provide less fragmented results. Finally, once the BP for the novelty component has been estimated, we can rely on a heuristic based on the cluster sizes to discriminate anomalies from actual new classes. Let us suppose that the BP consists of S novel clusters. Denote the number of instances assigned to cluster s ∈ {1, . . . , S} with m nov s . A cluster s is considered to be an agglomerate of outlying points if its cardinality m nov s is sufficiently small in comparison to the entire novelty sample size, otherwise it is regarded as a proper novel group.

Simulation Study
In this section, we present a simulation study aimed at highlighting the capabilities of the new semiparametric Bayesian model in performing novelty detection and we compare it with existing methodologies. We consider different scenarios varying the sample sizes of the hidden classes and the adulteration proportions in the training set. At the same time, we evaluate the importance of the robust information extraction phase and how it affects the learning procedure.

Experimental setup
We consider a training set formed by J = 3 observed classes, each distributed according to a bivariate Normal density N 2 (µ j , Σ j ), j = 1, 2, 3, with the following parameters: The class sample sizes are, respectively, equal to n 1 = 300, n 2 = 300 and n 3 = 400. The same groups are also present in the test set, together with four previously unobserved classes. We generate the new classes via bivariate Normal densities with parameters: The test set encompasses a total of 7 components: 3 observed and 4 novelties. Starting from the above-described data generating process, we consider four different scenarios varying: • Data contamination level -No contamination in the training set (Label noise = False) -12% label noise between classes 2 and 3 (Label noise = True) • Test set sample size -Novelty subset size equal to slightly more than 30% of the test set (Novelty size = Not small) Figure 3 exemplifies the experiment structure displaying a realization from the Label noise = True, Novelty size = Not small scenario. As it is evident from the plots, the label noise is strategically included to cause a more difficult identification of the fourth class, should the parameters of the second and third classes be non-robustly learned. Further, notice that the last group presents limited sample size and variability: it could easily be regarded as pointwise contamination (i.e., an anomaly) rather than an actual new component. Nonetheless, following the reasoning outlined in the introduction, we are interested in evaluating the ability of the nonparametric density to capture and discriminate these types of peculiar patterns as well. For each combination of contamination level and test set sample size, we simulate B = 100 datasets. Results are reported in the following subsection.

Simulation results
We compare the performance of the Brand model with different hyper-parameters specifications: • the information from the training set is either non-robustly (η M CD = 1) or robustly (η M CD = 0.75) extracted, • the precision parameter associated with the training prior belief is either very high (λ T r = 1, 000) or moderately low (λ T r = 10).
In addition, two model-based adaptive classifiers are considered in the comparison, namely the inductive RAEDDA model (Cappozzo et al., 2020) with labeled and unlabeled trimming levels respectively equal to 0.12 and 0.05, and the inductive AMDA model (Bouveyron, 2014). For each replication of the simulated experiment, a set of four metrics is recorded from the test set: • Novelty predictive value (Precision): the proportion of units marked as novelties by a given method truly belonging to classes 4, . . . , 7, • Accuracy on the observed classes subset: the classification accuracy of a given method within the subset of groups already observed in the training set,   (19), for B = 100 repetitions of the simulated experiment, varying data contamination level and Brand hyper-parameters, Not small novelty subset size. The brighter the color the higher the probability of belonging to f nov .
• Adjusted Rand Index (ARI, Rand, 1971): measuring the similarity between the partition returned by a given method and the underlying true structure, • PPN : a posteriori probability of being a novelty, computed according to Equation We run 40, 000 MCMC iterations and discard the first 20, 000 as a burn-in phase. Apart from the hyper-parameters for the training components, fairly uninformative priors are employed in the base measure H, with m 0 = (0, 0) , λ 0 = 0.01, ν 0 = 10 and S 0 = 10I 2 . Lastly, a Gamma DP concentration parameter is considered with prior rate and scale hyper-parameters both equal to 1. Figure 4 reports the results for B = 100 repetitions of the experiment under the different simulated scenarios. A Table containing the values on which this plot is built is deferred to the Supplementary Material. The Novelty predictive value metric highlights the capability of the model to correctly recover and identify the previously unseen patterns. As expected, in the adulteration-free scenarios, all methodologies succeed well enough in separating known and hidden components. The worst performance is exhibited by the RAEDDA model for which, due to the fixed trimming level, a small part of the group-wise most extreme (but still genuine) observations is discarded, thus slightly overestimating the novelties percentage (the same happens for the ARI metric). Different results are displayed in scenarios wherein the label noise complicates the learning process. Robust procedures efficiently cope with the adulteration present in the training set, while the AMDA and the Brand methods when η M CD = 1 tend to largely overestimate the novelty component. Particularly, the harmful effect caused by the mislabeled units is exacerbated in the Brand model that sets high confidence in the priors (λ T r = 1, 000), while a partial mitigation, albeit feeble, emerges when λ T r is set equal to 10. This consequence is even more apparent in the hex plots of Figure 5, where we see that the latter model tries to modify its prior belief to accommodate the (outlier-free) test units, while the former, forced to stick close to its prior distribution by the high value of λ T r , incorporates the second and third class in the novelty term. The final output, as displayed in the Accuracy on the observed classes subset boxplots, has an overall high misclassification error when it comes to identifying the test units belonging to the previously observed classes. Differently, setting robust informative priors prevents this undesirable behavior, as it is shown by both the high level of accuracy and the associated low posterior probability of being a novelty in the feature space wherein the observed groups lie. On the other hand, the true partition recovery, assessed by the Adjusted Rand Index, does not seem to be influenced by the label noise, with our proposal always outperforming the competing methodologies regardless of which hyperparameters are selected. As previously mentioned, for Brand(η M CD = 1, λ T r = 10) and Brand(η M CD = 1, λ T r = 1, 000) cases the second and third classes are assimilated into the nonparametric component in the Label Noise = True scenario. This is due to the fact that the mislabeled units prevent Brand from correctly learning the true structures of groups two and three in Stage I. As a consequence, no correspondence between these improperly estimated classes in the training is found in the test set, so much so that the DP prior creates them anew within the novelty term. Clearly, this is a sub-optimal behavior as the separation of what is known from what is novel is completely lost, yet it may raise suspicion on dealing with a contaminated learning set, suggesting the need of a robust prior information extraction.
Additional simulated experiments, involving a high-dimensional scenario and novelty detection problem under model misspecification are included in the Supplementary Material.

X-ray images of wheat kernels
Sophisticated and advanced techniques like X-rays, scanning microscopy and laser technology are increasingly employed for the automatic collection and processing of images. Within the domain of computer vision studies, novelty detection is generally portrayed as a one-class classification problem. There, the aim is to separate the known patterns from the absent, poorly sampled or not well defined remainder (Khan and Madden, 2014). Thus, there is strong interest in developing methodologies that not only distinguish the already observed quantities from the new entities, but that also identify specific structures within the novelty component. The present case study involves the detection of a novel grain type by means of seven geometric parameters, recorded postprocessing X-ray photograms of kernels (Charytanowicz et al., 2010). In more detail, for the 210 samples belonging to the three different wheat varieties, high quality visualization of the internal kernel structure is detected using a soft X-ray technique and, subsequently, the image files are post-processed via a dedicated computer software package (Strumi l lo et al., 1999). The obtained dataset is publicly available in the University of California, Irvine Machine Learning data repository. This experiment involves the random selection of 70 training units from the first two cultivars, and a test set of 105 samples, including 35 grains from the third variety. The resulting learning scenario is displayed in Figure  6. The aim of the analysis is to employ Brand to detect the third unobserved variety, whilst performing classification of the known grain types with high accuracy. Firstly, the MCD estimator with hyper-parameter h M CD = 0.95 is adopted for robustly learning the training structure of the two observed wheat varieties. In the second stage, our model is fitted to the test set, discarding 20, 000 iterations for the burn-in phase, and subsequently retaining 10, 000 MCMC samples. As usual, fairly uninformative priors are employed in the base measure H, with m 0 = 0, λ 0 = 0.01, ν 0 = 10 and S 0 = I 7 , where 0 denotes the 7-dimensional zero vector. For the training components, mean and covariance matrices of the Normal-inverse-Wishart priors are directly determined by the MCD output of the first stage, while ν T r and λ T r are specified to be respectively equal to 250 and 1, 000. The latter value indicates that after having robustly extracted information for the two observed classes, high trust is placed in the prior distributions of the known components. Model results are reported in Figure 7, where the posterior probability of being a novelty P P N m = P [y m ∼ f nov |Y], m = 1, . . . , M , displayed in the plots below the main diagonal, are estimated according to the ergodic mean in (19). The a posteriori classification, computed via majority vote, is depicted in the plots above the main diagonal, where the water-green solid diamonds denote observations belonging to the novel class. The confusion matrix associated with the estimated group assignments is reported in Table 1, where the third group variety is effectively captured by the flexible process modeling the novel component. All in all, the promising results obtained with this multivariate dataset may foster the employment of our methodology in automatic image classification procedures that supersede the one-class classification paradigm, allowing for a much more flexible anomaly and novelty detector in computer vision applications.  : Test set for the considered experimental scenario, seeds dataset. Plots below the main diagonal represent the estimated posterior probability of being a novelty. The brighter the color the higher the probability of belonging to f nov . Plots above the main diagonal display the associated group assignments, where the water-green solid diamonds denote observations classified as novelties.

Functional novelty detection of meat variety
In recent years, machine learning methodologies have experienced an ever-growing interest in countless fields, including food authentication research (Singh and Domijan, 2019). An authenticity study aims to characterize unknown food samples, correctly identifying their type and/or provenance. Clearly, no observation is to be trusted in a context wherein the final purpose is to detect potentially adulterated units, in which, for example, an entire subsample may belong to a previously unseen pattern. Motivated by a dataset of Near Infrared Spectra (NIR) of meat varieties, we employ the functional model introduced in Section 4 to perform classification and novelty detection when having a hidden class and four manually adulterated units in the test set. The considered data report the electromagnetic spectrum for a total of 231 homogenized meat samples, recorded from 400 − 2498 nm at intervals of 2 nm (McElhinney et al., 1999). The units belong to five different meat types, with 32 beef, 55 chicken, 34 lamb, 55 pork, and 55 turkey records. The amount of light absorbed at a given wavelength is recorded for each meat sample: A = log 10 (1/R) where R is the reflectance value. The visible part of the electromagnetic spectrum (400 − 780 nm) accounts for color differences in the meat types, while their chemical composition is recorded further along the spectrum. NIR data can be interpreted as a discrete evaluation of a continuous function in a bounded domain. Therefore, the procedure described in Section 4 is a sensible methodological tool for modeling this type of data objects (Barati et al., 2013). We randomly partition the recorded units into labeled and unlabeled sets. The former includes 28 chicken, 17 lamb, 28 pork, and 28 turkey samples. The latter contains the same proportion of these four meat types with an additional 32 beef units. The last class is not observed in the test set and needs to be discovered. Also, four validation units are manually adulterated and added to the test set as follows: • a shifted version of a pork sample, achieved by removing the first 15 data points and appending the last 15 group-mean absorbance values at the end of the spectrum; • a noisy version of a pork sample, generated by adding Gaussian white noise to the original spectrum; • a modified version of a turkey sample, obtained by abnormally increasing the absorbance value in a single specific wavelength to simulate a spike; • a pork sample with an added slope, produced by multiplying the original spectrum by a positive constant.
These modifications mimic the ones considered in the "Chimiométrie 2005" chemometric contest, where participants were tasked to perform discrimination and outlier detection of mid-infrared spectra of four different starches types (Fernández Pierna and Dardenne, 2007). In our context, both the beef subpopulation and the adulterated units are previously unseen patterns that shall be captured by the novelty component. Firstly, we extract robust prior information from the learning set. Given the spectra non-cyclical nature, we approximate each training unit via a linear combination of  (19), the brighter the color the higher the probability of belonging to f nov . B = 100 B-spline bases, and their associated coefficients are retrieved. Given the highdimensional nature of the smoothing process, the MRCD is employed to obtain robust group-wise estimates for the splines coefficients. These quantities, which are linearly combined with the B-spline bases, account for the training atoms Θ j , j = 1, . . . , 4 specified in Equation (11). We adopt a value of η M CD = 0.75 in the first stage, providing functional atoms robust against contamination that may arise in the training set. In this experiment, an inductive approach is considered, for which the training estimates will be kept fixed throughout the subsequent Bayesian learning phase. We further set a τ = 3, b τ = 1, s 2 = 1, a H = 5, and b H = 1, inducing low variability on the noise parameters as much as not to compromise the hierarchical structure between known and novelty components. A more detailed discussion on the hyperparameters choice is deferred to the Supplementary Material, where we evaluate alternative effects for different prior settings within a controlled experiment. OnceΘ j , j = 1, . . . , 4 are retained, the Bayesian model of Section 4 is applied to the test units running a total of 20,000 iterations and discarding the first 10,000 as warm-up. Figure 8 summarizes the results of the fitted model. Each spectrum is colored according to its a posteriori probability of being a novelty, computed as in (19). The resulting confusion matrix is reported in Table 2, where it is apparent that the previously unseen class, as well as the adulterated units (labeled as "Outliers" in the table), are successfully captured by the novelty component. The obtained classification accuracy is in agreement with the ones produced by state-ofthe-art classifiers in a fully-supervised scenario (see, for example, Murphy et al., 2010;Gutiérrez et al., 2014). That is, our proposal is capable of detecting previously unseen classes and outlying units, whilst maintaining competitive predictive power. Looking at the classification performance, we observe that Brand can correctly recover the underlying data partition, except for the turkey subgroup. Specifically, the model struggles to separate the turkey units from the chicken ones.   Focusing on the novelty component, the model entirely captures the beef hidden class and the adulterated units, yet two turkey samples are also incorrectly assigned. The obtained classification for the curves identified to be novelties, resulting by VI minimization, is displayed in the left panel of Figure 10, where two distinct clusters are detected. Interestingly, Brand separates the 32 beef samples (blue dashed lines) from the two turkeys (solid red lines) and classifies three of the four manually adulterated units to the outlying cluster. In contrast, the remaining one is assigned to the beef class, because of its peculiar shape, as it is shown in Figure 13 of the Supplementary Material.
Finally, we investigate why two turkey units are incorrectly assigned to the novel component. A closer look at the turkey sub-population, displayed in the right panel of Figure 10, shows how these two samples exhibit a somehow extreme pattern within their group and can, therefore, be legitimately flagged as outlying or anomalous turkeys. We report two additional figures in the Supplementary Material. The first provides a DP clustering of novelties visual summary of the estimated grouping; the second shows how the turkey test units are partitioned into different clusters.
In this section, we have shown the effectiveness of our methodology in correctly identifying a hidden group in a functional setting, while jointly achieving good classification accuracy and detection of outlying curves. The successful application of the model seems particularly desirable in fields like food authenticity, where generally there is no a priori available information on how many modifications and/or adulteration mechanisms may be present in the samples.

Conclusion and discussion
We have introduced a two-stage methodology for robust Bayesian semiparametric novelty detection. In the first stage, we robustly extract the observed group structure from the training set. In the second stage, we incorporate such prior knowledge in a contaminated mixture, wherein we have employed a nonparametric component to describe the novelty term. The latter could either correspond to anomalies or actual new groups. This distinction is made possible by retrieving the best partition within the novel subset. We have investigated the properties of the random measure underlying the model and its connections with existing methods. Subsequently, the general multivariate methodology has been extended to handle functional data objects, resulting in a novelty detector for functional data analysis. A dedicated slice-efficient sampler, taking into account the difference between unseen and seen components, has been devised for posterior inference. An extensive simulation study and applications on multivariate and functional data have validated the effectiveness of our proposal, fostering its employment in diverse areas from image analysis to chemometrics. Brand can represent the starting point for many different research avenues. Future research directions aim at providing a Bayesian interpretation of the robust MCD estimator to propose a unified, fully Bayesian model. More versatile specifications can be adopted for the known components, weakening the Gaussianity assumption. These extensions can be obtained by adopting more flexible distributions while keeping the mean and variance of the resulting densities constrained to the findings in the training set, for example, via centered stick-breaking mixtures (Yang et al., 2010). Similarly, functional Brand can be improved by adopting a more general prior specification via Gaussian Processes (Rasmussen and Williams, 2005). Lastly, it is of paramount interest to develop scalable algorithms, as Variational Bayes (Blei et al., 2017) and Expectation-Maximization (Dempster et al., 1977), for inference on massive datasets. Such solutions will offer both increased speed and lower computational cost, which are crucial for assuring the applicability of our proposal in the big data era.

Supporting Information
The Supplementary Material referenced throughout the article is available with this paper at the Statistics and Computing website. As supporting information, we report the proofs of the theoretical results showed in Section 3. Moreover, to complement the results presented in Section 6, we showcase the performance obtained by applying Brand to various challenging simulated data, varying the distributional assumptions and dimensionality. We also discuss an application to a higher dimensional real dataset, the popular benchmark Wine dataset from the UCI dataset repository, considering all its 13 features. Lastly, with the help of a controlled experiment, we guide the reader through the choice of hyperparameters and, more broadly, the whole usage of the model in the functional case. Software routines, including the implementation for both methods, the simulation study, and real data analyses of Section 6 are openly available at github. com/AndreaCappozzo/brand-public_repo. π ∼ Dir(a 0 + m 0 , a 1 + m 1 , . . . , a J + m J ).

Sample the SB variables after integrating out
5. Compute the SB weights according to (5) 6. Compute the one-line probability weightsπ according to (14). 7. Sample the atoms for the observed classes Θ j,0 exploiting conjugacy between the likelihood and the prior for j = 1, . . . , J. 8. Sample the atoms for the novel classes Θ nov 0,l exploiting conjugacy between the likelihood and the prior for l = 1, . . . , L * . 9. ObtainΘ concatenating the updated values of Θ and Θ nov . 10. Sample each ξ m from the following joint discrete distribution: 11. Recover the values for the membership vectors α and β using (15).
Divide the elements inΘ into Θ and Θ nov . end

Supplementary Material
In this Supplementary Material, we report proofs for the theoretical results reported in the main paper and some additional numerical experiments, for both the multivariate Brand and its functional extension.

Proofs
First, we derive the moments, the variance, and the covariance for the simpler case (A):p 0 = J j=0 π j δ Θ j . Then, we derive the same quantities starting from the discrete random measure that underlies Brand (B) a j (a j + 1) a(a + 1) µ j,2 + 2 j>l≥0 a j a l a(a + 1) µ j µ l .
a j a l a 2 (a + 1) µ j µ l .

Multivariate Brand -Additional experiments
The present section integrates and extends the simulation study reported in Section 6.1 of the manuscript. Particularly, the multivariate Brand is applied to a real 13-dimensional dataset (Section 10.1), and to two additional simulation studies: the former dealing with non-Gaussian shaped components (Section 10.2) and the latter encompassing higherdimensional, sparse covariance structures (Section 10.3).

Wine dataset
The dataset, publicly available in the University of California Irvine Machine Learning repository, comprises 13 chemical measurements from 178 wine samples from the Piedmont region, Italy (Forina et al., 1986). The samples arise from three different cultivars: Barolo, Grignolino, and Barbera. We randomly select 66 samples from the first two varieties to build the training set, whereas the remaining 112 samples, including 48 wines from the third cultivar, defines the test set. The multivariate Brand methodology is fitted to the datasets to detect the third unobserved wine type. The model hyperparameters are set as follows. First, h M CD = 0.95 induces robust priors elicitation for the two observed classes. Fairly uninformative priors are used for the base measure H, namely m 0 = 0, λ 0 = 0.01, ν 0 = 15 and S 0 = I 13 , where with 0 we denote the 13dimensional zero vector. Such priors agree with the ones employed in Section 6.2.1 in the main paper for the seed dataset, underlying their general applicability in scenarios where no initial information is available. After initiating the MCMC with 20, 000 iterations for the burn-in phase, 20, 000 dependent samples are retained from the target posterior distribution. The resulting classification is reported in Table 3: our semi-parametric model almost perfectly recovers the underlying partition, efficiently identifying the novel wine type. The derived accuracy is particularly high, with performance comparable to those obtained in fully-supervised experiments (Aeberhard et al., 1993).

Non-Gaussian shaped classes
In this section, we employ the same hyperparameters configuration considered for the experimental setup in Section 6.1.1 of the main paper to generate samples from elliptical distributions that differ from the Gaussian, with the final aim of validating Brand performance under model misspecification. As discussed, the general structure reported in Equation (1) allows for any distributional kernel specification. However, departures from Gaussianity would induce the loss of the conjugacy properties, worsening the computational cost. Therefore, we now want to test how robust the Gaussian kernel is in capturing symmetric components with heavier tails. To this extent, we repeat the experiment for the Label noise = False and Novelty size = Not small scenario, with resulting sample sizes being equal to for the training and test sets, respectively. In contrast to the simulations reported in the main paper, we generate groups 1 to 5 via a multivariate t distribution with 5 degrees of freedom, while classes 6 and 7 are realizations from a multivariate Laplace distribution. The resulting learning framework is displayed in Figure 11. The simulation results, for the same hyper-parameters specification employed in the main paper, are reported in Figure 12. We immediately notice that different prior settings do not substantially influence the overall model performance, with satisfactorily good results showcased for all the considered metrics. Notwithstanding, when we compare the simulation outcome in Figure 12 with the one displayed in Figure 4 of the main paper, it is apparent that the results are in some measure influenced by model misspecification.
In detail, the novelty term tends to absorb all those units sampled from the (heavy) tails of the known components, as they present very low Gaussian density, particularly when η M CD = 0.75. Consequently, the clustering induced by the DPMM shows many more groups a posteriori in this scenario, with singletons trying to accommodate patterns that the main classes cannot explain. While clearly accuracy measures for assessing the goodness of a clustering procedure shall be application-dependent (Hennig, 2015), we argue here that, in principle, the outcome showcased by our method could still be relevant in contexts that violate the Gaussianity assumption. More specifically, if the sought classes are assumed to be unimodal and elliptical, both stages in the brand methodology concur to achieve this result. Stage I trims the group-wise most outlying values, and Stage II flexibly captures all the unexplained variability with Gaussian-like shapes. Once this has been accomplished, one can employ the output evaluation in terms of trimmed units and a posteriori assignment to determine which assumptions were not met by the dataset at hand. As an example, the identification of small novel clusters in the vicinity of bigger ones may be an indication that components with heavier tails are needed to properly account for the true underlying partition. For a general overview on the difficult problem of finding groups in data and associated clustering validity measures, the interested reader is refered to Akhanli and Hennig (2020).

High-dimensional heteroscedastic classes
For this experiment, we consider the same number of classes (known and hidden ones) and sample sizes for training and test sets introduced in the previous section. Each component is distributed according to a multivariate Normal density with mean vectors equal to: and covariance matrices exhibiting different degrees of sparsity, as illustrated in Figure  13. We display the resulting learning framework in Figure 14, where the training and test samples are reported in the lower and in the upper diagonal plots, respectively. Notice that the generative mechanism induces a quite challenging novelty detection problem, as the last four dimensions are irrelevant for group separation. By again monitoring the metrics defined in the main manuscript, we aim at investigating the Brand performance when dealing with a 10-dimensional dataset with heterogeneous covariance patterns. Given the well-established efficacy of both the MCD and RMCD estimators in high-dimensional settings (Rousseeuw and Driessen, 1999;Hubert et al., 2018;Boudt et al., 2020), we focus here on the second stage of the Brand method, studying its sensitivity under different hyper-parameters specifications. In detail, a total of 8 models are fitted to B = 100 simulated datasets varying: • the concentration parameter γ of the Dirichlet Process prior, letting it be equal to 1 or 5, • the degrees of freedom ν 0 associated with the novelty components, letting it be equal to 10 or 100, • the precision parameter λ 0 associated with the novelty components, letting it be equal to 0.01 or 1.
Simulation results are reported in Figure 15. We immediately notice that the overall performance showcased by our methodology is satisfactorily good, regardless of the hyperparameters specification. Both known and novel patterns are correctly identified as such, with results mirroring the ones obtained in the bi-dimensional experiments reported in Section 6.1.2 of the main paper. In particular, eliciting very flat priors for the base measure H (λ 0 = 0.01, ν 0 = 10) still produces excellent outcomes. The only appreciable difference is in terms of ARI, induced by the DP concentration parameter γ. Specifically, increasing γ favors a priori the creation of more clusters. As a result, the model is more prone to accommodate components with a limited sample size, like the seventh group in the present experiment. On the other hand, when γ = 1 the 20 test units arisen by the last density are usually merged within some other components, producing the slight difference in ARI visible in the middle plot of Figure 15.

Functional Brand -Controlled experiment
This section investigates how functional Brand behaves according to various prior specifications and different levels of noise in the data. We aim to provide the researchers and practitioners with guidelines for the hyperparameter specification to exploit the flexibility of our model at its fullest. In the next experiments, we will consider the following six functions, evaluated on the interval t ∈ [0, 6]: f 1 (t) = 5 cos(exp(sin(t))), f 2 (t) = 3 log(sin(t 1.5 ) + 1), f 6 (t) = |t − 1| 2 sin(t). Figure 16 provides a visual representation of the six functions. Each of these functions presents its distinctive peculiarities. Simultaneously, they overlap, especially in the left half of the support, which could make the classification more difficult conditioning on the considered noise level. These six functions constitute the functional means of different groups we are going to study. We suppose that every function is observed on a discretized collection of 100 time points between 0 and 6. In all our simulation settings, we fix the number of functions sampled from each group to 50. We assume that samples from the first three groups are contained in both the training set X and in the test set Y , while samples from the last three groups represent the novelties we want to detect, and therefore they appear only in Y . We are left with 150 data objects in the training sets and 300 in the test sets. More formally, for each group in the training set ntained in both the training set X, we have x j (t) = f j (t) + j (t) with j = 1, . . . , 3, and j (t) ∼ N (0, σ 2 ) ∀t = 1, . . . , 100.
Finally, we assume three different levels of noise in the data generating process, considering σ ∈ {0.25, 0.50, 0.75}, yielding three simulated cases of training and test sets (X 1 , Y 1 ), (X 2 , Y 2 ) and (X 3 , Y 3 ,). Figure 17 shows the three generated test sets, stratified by the increasing noise level. As already mentioned, a higher level of noise induces a more evident overlap among the functional objects. We expect this overlap to make the classification more challenging.

The robust estimation of the mean function
We focus on Stage I, and we investigate the robust extraction of prior information in the functional case. In the main paper, we discuss how we smooth the training functions using B-splines bases. Similarly, for this study, we use 100 bases of order 5, which provide accurate results for a reasonable computational cost. Our functional robust estimation works as follows. We first collect the spline coefficients, and we interpret them as multivariate objects. We then apply the MCD estimator on the spline coefficients. We therefore recover, for each group, the robust mean coefficients that, once convoluted with the bases, will produce our mean functions. A crucial quantity is the percentage of extreme coefficients to remove in our robust estimation. This percentage significantly affects also the recovery of the training noiseσ 2 j (t). In the following, we demonstrate how the ratio of trimmed values η M CD affects Stage I in the functional case.
To showcase the effect of the MCD estimator, we contaminate the known classes in the training sets with time-point specific outliers and label noise. In detail, we add noise sampled from ε(t) ∼ N (0, 25) to fifteen randomly selected functions at fifteen random timestamps. We contaminate five functions from the first group, four in the second, and six in the third. Then, we shuffle 10% of the functions across groups to generate label noise. To illustrate this step, we report in Figure 18 a visual depiction of the contaminated version of the training dataset X 3 , stratified by class. Figure 18: Functional objects contained in X 3 -generated specifying σ = 0.75. Each known class is characterized by label noise and time-specific outliers (highlighted by red dots).
We consider three different specifications of η M CD : • no trimming by setting η M CD = 1: all the estimated spline coefficients are considered; • mild trimming, η M CD = 0.95: only 5% of the estimated spline coefficients are removed for the estimation of the group centroids and functional noise; • strong trimming. η M CD = 0.75: a fourth of the estimated spline coefficients is removed before computing the group centroids and functional noise.
We report the extracted mean functions and the estimated noise in the panels of Figure 19. Specifically, the group centroids are depicted in red, superimposed onto the observations of every group. We also plot in blue the intervals of variation computed as f j (t) ± q ·σ 2 j (t) to give an idea of the estimated noise, for j, q = 1, 2, 3. Each column of the plot shows the evolution of the mean and noise functions when considering different trimming levels, stratified by group. We can observe how employing trimming benefits the estimation of the representative functions in each group. For example, in the first row of Figure 19, we see how the red line fails to represent the group characteristics. This behavior is exacerbated close to the boundaries in all the groups. It is evident how even a small percentage of trimming yields considerable benefits. In our experience, we obtain very satisfactory results in the functional case when considering a strong trimming. Results in a case of extreme trimming, obtained by setting η M CD = 0.5 were explored but not reported, since similar to the case η M CD = 0.75. In conclusion, if enough data are available, we suggest considering strong trimming as the safest option when dealing with the functional case. In this way, we can eradicate the influence of potential outliers on both the mean and the variance. We will use the mean and noise functions recovered with η M CD = 0.75 for the subsequent analyses.

Model performance according to different prior specifications
We can summarise the hyperparameters specific of functional Brand in three main subsets: • Subset 1: hyperparameters controlling the dispersion around the extracted mean functions of the known groups: ϕ j and ν j ; • Subset 2: hyperparameters controlling the dispersion around the novel spline coefficients, representing the unknown mean functions: a τ , b τ , and s 2 ; • Subset 3: hyperparameters controlling the dispersion of the noise around the novel mean functions: a H and b H .
We investigate the sensitivity of functional Brand to various hyperparameter specifications while considering the different noise levels in the observed data, as previously discussed. As a first consideration, we point out that setting the hyperparameters controlling the variance around the known classes and/or the novel ones requires the most care. A poorly elicited prior for the error terms could lead to losing Brand hierarchical structure, letting the novelty components take over the known groups in lieu of their flexibility. The opposite issue may also arise, since larger mixture weights are assigned to the known components. In the following, we will provide some considerations regarding the tuning of the hyperparameters listed in Subset 1 and Subset 2. For the sake of brevity, we omit the results obtained tweaking the parameters that belong to Subset 3, which control the noise level in the novelty term σ 2 nov h (t), ∀h. Given the insights collected in several applications, we suggest eliciting an Inverse Gamma prior that favors low noise values, promoting more sensitive estimation of the random functions f nov h (t), ∀h. Thus, we fix a H = 5 and b H = 1 in the following, as well as in the functional application in the main paper. Let us first discuss the hyperparameter specification that we adopted in the main text. We are going to assume these values as our default setting. The values are ϕ j ≈ 0, ν j ≈ 0, a τ = 3, b τ = 1, s 2 = 1 a H = 5, b H = 1.
First, let us describe and justify our choice of hyperparamters. The parameters in Subset 1 are set to negligible values. In this way, we are implicitly enforcing an inductive learning approach: the information extracted from the training set for the known groups is trustworthy and need not be updated. We calibrate the parameters in Subset 2 to induce a reasonably tight prior over the novel splines coefficients to help the model identifiability. One can claim that an uninformative specification would also be reasonable to reflect our ignorance about the novel groups. However, as we will show, letting the parameters to vary too freely may negatively affect the results. Finally, as already discussed, the informative specification of the parameters in Subset 3 bounds the variance of the errors around the novel functions: this will avoid limiting cases where all the information contained in the data is explained by the noise term, helping the identification of the latent mean functions.
In the following, we tweak one subset of parameters at a time to investigate its effect, starting from this prior configuration. In all the subsequent experiments, we run 7,500 MCMC iterations after a burn-in period of the same length.

Tuning hyperparameters: Subset 1
For this first experiment, we consider the three hyperprior configurations devised to increase the uncertainty around the functional means and variance extracted from the training set. In other words, we want to understand how, ceteris paribus, the concentration of the priors for f j (t) and σ 2 j (t), j = 1, . . . , J affects the estimation of the mean functions of the known groups. As already mentioned, a proper elicitation of the prior noises σ 2 j (t) for the known classes is crucial for carrying out sensible inference. Throughout many simulation studies, we observed that σ 2 j (t) plays a fundamental role in preserving the specific hierarchy that characterizes Brand. In fact, in scenarios where the prior variance of the measurement errors around the training mean is too wide, the known components lose their "priority" over the novel ones. In other words, the robust estimates extracted from the training set are deemed as less representative of the truth. Simultaneously, the flexibility of the novelty terms leaves the new estimated functions free to adapt to the dataset, ending up absorbing the known classes. In short, if the variances of the known components are not carefully tuned, Brand loses the ability to distinguish between known groups and novelties, hindering the resulting classification. To provide an idea on how different choices of ϕ j , and ν j affect the results, we consider: HPC 1 -Inductive setting: ϕ j ≈ 0, ν j ≈ 0, HPC 2 -Informative specification: ϕ j = 0.01, ν j = 1/100000, 48 HPC 3 -"Uninformative" specification: ϕ j = 1, ν j = 1/100. The variance around the mean function is set to 1.
For each configuration, we estimate functional Brand on the three different test sets Y 1 , Y 2 , and Y 3 . We show the confusion matrices displaying the assignation of the functional objects contained in the test set into known classes or novelties in Table 4. Moreover, to better exemplify the model's behavior, we display the functional estimates of the known mean functions for Y 3 in Figure 20. For each known group, we collect the robustly extracted mean functions (in red), the simulated posterior mean functions (computed pointwise, in blue) on top of the MCMC iterations (in black).
We can detect a recurring pattern across different datasets. On the one hand, the first and second prior configurations (HPC 1 and HPC 2) lead to perfect classification of the functional data into known components (Classes 1, 2, and 3 into Cluster 1, 2, and 3, respectively) and novelties (Classes 4, 5, and 6 into Cluster 0) regardless the level of noise in the dataset. Looking at the different panels of Figure 20, we see how HPC 2 allows more uncertainty in the mean estimates (represented by a wider range of variation of the MCMC simulations) and a difference between the starting robust means (in red) and the MCMC posterior means (in blue). As expected, in the inductive setting, the red and blue lines coincide. On the other hand, we can see what happens when tight constraints are not placed on the variances σ 2 j (t) by focusing on the last block of columns in Table 4 and the last row of panels in Figure 20. In all the three cases Y 1 , Y 2 , and Y 3 we see how the novelty component takes over and ruins the classification process. For example, in the bottom row of Figure 20 we see that in the three panels the estimated posterior mean function coincides with the prior since no data have been assigned to those components. In the first two datasets, we see how the higher variances lead Brand to mislead a novel group as a known one and viceversa. We can conclude that, for the functional setting, care is needed when tuning the discussed hyperprior parameters. Thus, when enough data are available, we suggest either to rely on the inductive learning process or, otherwise, to induce small a priori variations.    Figure 21: Summary of the resulting classification given by the model considered in Section 6.3. Each functional object is colored according to its correct data type, while each panel contains the meat spectra that Brand classifies together.  Figure 23: Functional objects classified as Novelties, colored according to their assigned cluster (Beef, in blue and Outliers, in gray). We highlighted in red the outlying functional data that is assigned to the Beef cluster.