Learning Extremal Representations with Deep Archetypal Analysis

Archetypes are typical population representatives in an extremal sense, where typicality is understood as the most extreme manifestation of a trait or feature. In linear feature space, archetypes approximate the data convex hull allowing all data points to be expressed as convex mixtures of archetypes. However, it might not always be possible to identify meaningful archetypes in a given feature space. Learning an appropriate feature space and identifying suitable archetypes simultaneously addresses this problem. This paper introduces a generative formulation of the linear archetype model, parameterized by neural networks. By introducing the distance-dependent archetype loss, the linear archetype model can be integrated into the latent space of a variational autoencoder, and an optimal representation with respect to the unknown archetypes can be learned end-to-end. The reformulation of linear Archetypal Analysis as deep variational information bottleneck, allows the incorporation of arbitrarily complex side information during training. Furthermore, an alternative prior, based on a modified Dirichlet distribution, is proposed. The real-world applicability of the proposed method is demonstrated by exploring archetypes of female facial expressions while using multi-rater based emotion scores of these expressions as side information. A second application illustrates the exploration of the chemical space of small organic molecules. In this experiment, it is demonstrated that exchanging the side information but keeping the same set of molecules, e. g. using as side information the heat capacity of each molecule instead of the band gap energy, will result in the identification of different archetypes. As an application, these learned representations of chemical space might reveal distinct starting points for de novo molecular design.


Introduction
Colloquially, both the words "archetype" and "prototype" describe templates or original patterns from which all later forms are developed.However, the concept of a prototype is more common in machine learning and for example encountered as cluster-centroids in classification, where a query point x is assigned to the class of the closest prototype.In an appropriate feature space such a prototype is a typical representative of its class, sharing all traits of the class members, ideally in equal proportion.By contrast, archetypes are characterized as being extreme points of the data, such that the complete data set can be well represented as a convex mixture of these extremes or archetypes.Archetypes thus form a polytope approximating the data convex hull.
Based on the historic Iris flower data set (Anderson, 1935;Fisher, 1936), Figure 1 illustrates the different perspectives both approaches provide in exploring the data.In Figure 1a the cluster means as well as the decision boundaries in a 2-dimensional feature space are shown.The clustering was calculated using the k-Means algorithm.Each cluster mean is a typical average representative of its respective class, the aforementioned prototype.According to this clustering, the prototypical Iris virginica has a sepal width of 3.1cm and a sepal length of 6.8cm.On the other hand, Figure 1b shows the positions of the three archetypal Iris flowers, which are typical extreme representatives.The archetypal Iris virginica has a sepal width of 3.0cm and a sepal length of 7.8cm.All flowers within the simplex are characterized as weighted mixtures of these archetypes while, in terms of convex mixtures, the optimal location of flowers outside the simplex are normal projections onto its surface.In general, a clustering approach is more natural if the existence of a cluster structure can be presumed.Otherwise, archetypal analysis might offer an interesting perspective for exploratory data analysis.

Exploring Data Sets Through Archetypes
Archetypal analysis (AA) was first proposed by Cutler and Breiman (1994).It is a linear procedure where archetypes are selected by minimizing the squared error in representing each individual data point as a mixture of archetypes.Identifying the archetypes involves the minimization of a non-linear least squares loss.

Archetypal Analysis
Linear AA is a form of non-negative matrix factorization where a matrix X ∈ R n×p of n data vectors is approximated as X ≈ ABX = AZ with A ∈ R n×k , B ∈ R k×n , and usually k < min{n, p}.The so-called archetype matrix Z ∈ R k×p contains the k archetypes z 1 , .., z j , .., z k with the model being subject to the following constraints: Constraining the entries of A and B to be non-negative and demanding that both weight matrices are row stochastic implies a representation of the data vectors x i=1..n as a weighted sum of the rows of Z while simultaneously representing the archetypes z j=1..k themselves as Fig. 1: Result of a clustering procedure as well as an archetypal analysis, performed on the Iris data set.For clustering, the k-means algorithm was used, which is an unsupervised clustering algorithm identifying the average representatives of a data set, i. e. the clustercentroids.Archetypal Analysis on the other hand, seeks to identify extremes in the data set with the goal to represent individual data points as weighted mixtures of these extreme points, the so-called archetypes.
a weighted sum of the n data vectors in X: Due to the constraints on A and B in Eq. 1 both the representation of x i and z j in Eq. 2 are convex combinations.Therefore the archetypes approximate the data convex hull and increasing the number k of archetypes improves this approximation.The central problem of AA is finding the weight matrices A and B for a given data matrix X and a given number k of archetypes.The non-linear optimization problem consists in minimizing the following residual sum of squares: A probabilistic formulation of linear AA is provided by Seth and Eugster (2016) where it is observed that AA follows a simplex latent variable model and normal observation model.The generative process for the observations x i in the presence of k archetypes with archetype weights a i is given by with uniform concentration parameters α j = α for all j, and weights summing up to a i 1 = 1.That is, the observations x i are distributed according to isotropic Gaussians with means µ i = a i Z and variance 2 .

A Biological Motivation for Archetypal Analysis
Conceptionally, the motivation for Archetypal Analysis is purely statistical but the method itself always implied the possibility of interpretations with a more evolutionary flavour.By representing an individual data point as a mixture of pure types or archetypes, a natural link to the evolutionary development of biological systems is implicitly established.The publication by Shoval et al. (2012) entitled 'Evolutionary Trade-Offs, Pareto Optimality, and the Geometry of Phenotype Space' made this connection explicit, providing a theoretical foundation of the 'archetype concept'.In general, evolutionary processes are multi-objective optimization problems and as such subject to unavoidable trade-offs: If multiple tasks need to be performed, no (biological) system can be optimal at all tasks at once.Examples of such trade-offs include those between longevity and fecundity in Drosophila melanogaster where long-lived flies show decreased fecundity (Djawdan et al., 1996) or predators that evolve to be fast runners but eventually have to trade-off their ability to subdue large or strong prey, e.g.cheetah versus lion (Garland, 2014).Such evolutionary trade-offs are known to affect the range of phenotypes found in nature (Tendler et al., 2015).
In Shoval et al. (2012) it is argued that best-trade-off phenotypes are weighted averages of archetypes while archetypes themselves are phenotypes specialized at performing a single task optimally.An example of an evolutionary trade-off in the space of traits (or phenospace) for different species of bats (Microchiroptera) is shown in Figure 2. Based on a study of bat wings by Norberg et al. (1987), each species is represented in a twodimensional space where the axis depict Body Mass and Wing Aspect Ratio.The latter is the square of the wingspan divided by the wing area.Table 1 gives  (Steuer, 1986), which was recently used in biology to study trade-offs in evolution (Schuetz et al., 2012;El Samad et al., 2005).All phenotypes that have evolved over time lie within a restricted part of the phenospace, the so-called Pareto front, which is the set of phenotypes that cannot be improved at all tasks simultaneously.If there were a phenotype being better at all tasks than a second phenotype, then the latter would be eliminated over time by natural selection.Consequently phenotypes on the Pareto front are the best possible compromise between the different requirements or tasks.

Related Work
Linear "Archetypal Analysis" (AA) was first proposed by Cutler and Breiman (1994).Since its conception, AA has known several advancements on the algorithmic side: In Stone and Cutler (1996) the authors pro-  2. From an evolutionary perspective, the phenotype is a consequence of the specialization, for details see (Shoval et al., 2012).
pose an archetype model able to identify archetypes in space and time, named "archetypal analysis of spatiotemporal dynamics".A similar problem is addressed in "Moving archetypes" by Cutler and Stone (1997).
Model selection is the topic of Prabhakaran et al. (2012), where the authors are concerned with the optimal number of archetypes needed to characterize a given data set.An extension of the original archetypal analysis model to non-linear kernel archetypal analysis is proposed by Bauckhage and Manshaei (2014); Mørup and Hansen (2012).In Kaufmann et al. (2015), the authors use a copula based approach to make AA independent of strictly monotone transformations of the input data.The reasoning is that such transformations should in general not influence which points are identified as archetypes.Algorithmic improvements by adapting a Frank-Wolfe type algorithm to speed-up the calculation of archetypes are made by Bauckhage et al. (2015).
A probabilistic version of archetypal analysis was introduced by Seth and Eugster (2016), lifting the restriction of archetypal analysis to real-valued data and instead allowing other observation types such as integers, binary, and probability vectors as input.Efficient "coresets for archetypal analysis" are proposed by Mair and Brefeld (2019) in order reduce the high computational cost due to the additional convexity-preserving constraints when identifying archetypes.
Although AA did not prevail as a commodity tool for pattern analysis, several applications have used it very successfully.In H. P. Chan et al. (2003), AA is used to analyse galaxy spectra which are viewed as weighted superpositions of the emissions from stellar populations, nebular emissions and nuclear activity.For the human genotype data studied by Huggins et al. (2007), inferred archetypes are interpreted as representative populations for the measured genotypes.In computer vision, AA has for example been used by Bauckhage and Thurau (2009) to find archetypal images in large image collections or by Canhasi and Kononenko (2015) to perform the analogous task for large document collections.In combination with deep learning, archetypal style analysis (Wynen et al., 2018) applies AA to learned image representations in order to realize artistic style manipulations.
Our work is based on the variational autoencoder model (VAE), arguably one of the most popular representatives of the class of "Deep Latent Variable Models".VAEs were introduced by Kingma and Welling (2013); Rezende et al. (2014) and use an inference network to perform a variational approximation of the posterior distribution of the latent variables.Important work in this direction include Kingma et al. (2014); Rezende and Mohamed (2015) and Jang et al. (2017).More recently, Alemi et al. (2016) have discovered a close connection between VAE models and the Information Bottleneck principle (Tishby et al., 2000).Here, the Deep Variational Information Bottleneck (DVIB) is a VAE where not the input X is reconstructed (i.e. decoded) but rather a datum Y , about which X is known to contain information.Subsequently, the DVIB has been extended in multiple directions such as sparsity (Wieczorek et al., 2018) or causality (Parbhoo et al., 2018).Akin to our work, AAnet is a model proposed by van Dijk et al. (2019) as an extension of linear archetypal analysis on the basis of standard, i. e. non-variational, autoencoders.In their work two regularization terms, applied to an intermediate representation, provide the latent archetypal convex representation of a non-linear transformation of the input.In contrast to our work, which is based on probabilistic generative models (VAE, DVIB), AAnet attempts to emulate the generative process by adding noise to the latent representation during training.Further, no side information is incorporated which can -and in our opinion should -be used to constrain potentially over-flexible neural networks and guide the optimisation process towards learning a meaningful representation.

Present Work
Archetypal analysis, as proposed by Cutler and Breiman (1994), is a linear method and cannot integrate any additional information about the data, e.g.labels, that might be available.Furthermore, the feature space in which AA is performed is spanned by features that had to be selected by the user based on prior knowledge.In the present work an extension of the original model is proposed such that appropriate representations can be learned end-to-end, side information can be incorpo-rated to help learn these representations and non-linear relationships between features can be accounted for.

Deep Variational Information Bottleneck
We propose a model to generalise linear AA to the nonlinear case based on the Deep Variational Information Bottleneck framework since it allows to incorporate side information Y by design and is known to be equivalent to the VAE in the case of Y = X, as shown in Alemi et al. (2016).In contrast to the data matrix X in linear AA, a non-linear transformation f (X) giving rise to a latent representation T of the data suitable for (nonlinear) archetypal analysis is considered.I.e. the latent representation T takes the role of the data X in the previous treatment.The DVIB combines the information bottleneck (IB) with the VAE approach (Tishby et al., 2000;Kingma and Welling, 2013).The objective of the IB method is to find a random variable T which, while compressing a given random vector X, preserves as much information about a second given random vector Y .The objective function of the IB is as follows where λ is a Lagrange multiplier and I denotes the mutual information.Assuming the IB Markov chain T − X − Y and a parametric form of Eq. 5 with parametric conditionals p φ (t|x) and p θ (y|t), Eq. 5 is written as As derived in Wieczorek et al. (2018), the two terms in Eq. 6 have the following forms: and Here h(Y ) = −E p(y) log p(y) denotes the entropy of Y in the discrete case or the differential entropy in the continuous case.The models in Eq. 7 and Eq. 8 can be viewed as the encoder and decoder, respectively.Assuming a standard prior of the form p(t) = N (t; 0, I) and a Gaussian distribution for the posterior p φ (t|x), the KL divergence in Eq. 7 becomes a KL divergence between two Gaussian distributions which can be expressed in analytical form as in Kingma and Welling (2013).I φ (T ; X) can then be estimated on mini-batches of size m as As for the decoder, E p(x,y) E p φ (t|x) log p θ (y|t) in Eq. 8 is estimated using the reparametrisation trick proposed by Kingma and Welling (2013); Rezende et al. (2014): with the reparametrisation As mentioned earlier, in the case of Y = X the original VAE is retrieved (Alemi et al., 2016).In our applications, we would like to predict not only the side information Y but also reconstruct the input X.Similar to the approach proposed in Gomez-Bombarelli et al.
(2018), we use an additional decoder branch to predict the reconstruction X.This extension requires an additional term I φ,ψ (t; x) in the objective function Eq. 6 and an additional Lagrange multiplier ν.The mutual information estimate I φ,ψ (t; x) is obtained analogously to Eq. 10.

Deep Archetypal Analysis
Deep Archetypal Analysis can then be formulated in the following way.For the sampling of t i in Eq. 10 the probabilistic AA approach as in Eq. 4 can be used which leads to where the mean µ i given through a i and variance σ 2 i are non-linear transformations of the data point x i learned by the encoder.We note that the means µ i are convex combinations of weight vectors a i and the archetypes z j=1..k which in return are considered to be convex combinations of the means µ i=1..m and weight vectors b j . 1 By learning weight matrices A ∈ R m×k and B ∈ R k×m which are subject to the constraints formulated in Eq. 1 and parameterised by φ, a non-linear transformation of data X is learned which drives the structure of the latent space to form archetypes whose convex combination yield the transformed data points.A major difference to linear AA is that for Deep AA we cannot identify the positions of the archetypes z j as there is no absolute frame of reference in latent space.We thus position k archetypes at the vertex points of a (k − 1)simplex and collect these fixed coordinates in the matrix Z fixed .These requirements lead to an additional distance-dependent archetype loss of ) where Z pred = BAZ fixed are the predicted archetype positions given the learned weight matrices A and B. For Z pred ≈ Z fixed the loss function AT is minimized and the desired archetypal structure is achieved.The objective function of Deep AA is then given by max φ,θ A visual illustration of Deep AA is given in Figure 3.The constraints on A and B can be guaranteed by using softmax layers and Deep AA can be trained with a standard stochastic gradient descent technique such as Adam (Kingma and Ba, 2014).Note that the model naturally allows to be relaxed to the VAE setting by omitting the side information term λI φ,θ (t; y) in Eq. 14.
Fig. 3: Illustration of the Deep AA model.Encoder side: Learning weight matrices A and B allows to compute the archetype loss AT in Eq. 13 and sample latent variables t as described in Eq. 12.The constraints on A and B in Eq. 1 are enforced by using softmax layers.Decoder side: Z fixed represent the fixed archetype positions in latent space while Z pred are given by the convex hull of the transformed data point means µ during training.Minimizing AT corresponds to minimizing the red-dashed (pairwise) distances.The input is reconstructed from the latent variable t.In the presence of side information, the latent representation allows to reproduce the side information Y as well as the input X.

The Necessity for Side Information
The goal of Deep AA is to identify meaningful archetypes in latent space which will subsequently allow the informed exploration of the given data set.The "meaning" of an archetype, and thereby the associated interpretation, can be improved by providing so-called side information, i.e. information in addition to the input data.If the input datum is for example an image, additional information could simply be a scalar-or vectorvalued label.Using richer side information, e.g.additional images, is possible, too.The fundamental idea is that information about what constitutes a typical representative (in the archetypal sense) might not be information that is readily present in the input X but dependent on -or even defined by -the side information.Taking a data set of car images as an example, what would be an archetypal car?Certainly, the overall size of a car would be a good candidate, such that smaller sports cars and larger pick-ups might be identified as archetypes.But introducing the fuel consumption of each car as side information would put sports cars and pick-ups closer together in latent space, as both car types often consume above average quantities of fuel.In this way, side information guides the learning of a latent representation which is informative with respect to exactly the side information provided.Consequently, typicality is not a characteristic of the data solely, but a function of the provided side information.And the selection of appropriate side information can only be linked to the question the user of a deep AA model tries to answer.

Archetypal Analysis: Dealing With Non-linearity
Data generation.For this experiment, data X ∈ R n×8 is generated that is a convex mixture of k archetypes Z ∈ R k×8 with k n.The generative process for the datum x i follows Eq. 4, where a i is a stochastic weight vector denoting the fraction of each of the k archetypes z j needed to represent the data point x i .A total of n = 10000 data points is generated, of which k = 3 are true archetypes.The variance is set to σ 2 = 0.05 and the linear 3-dim data manifold is embedded in a n = 8 dimensional space.Note that although linear and deep archetypal analysis is always performed on the full data set, only a fraction of that data is displayed when visualizing results.
Linear AA -non-linear data.Data is generated as described above and an additional non-linearity is intro-  duced by applying an exponential to one dimension of X which results in a curved 8-dimensional data manifold.Linear archetypal analysis is then performed using the efficient Frank-Wolfe procedure proposed by Bauckhage et al. (2015).For visualization, PCA is used to recover the original 3-dimensional data submanifold which is embedded in the 8-dimensional space.The first three principal components of the ground truth data are shown in Figure 4a as well as the computed archetypes (connected by dashed lines).The positions of the computed archetypes occupy optimal positions according to the optimization problem in equation 3 but due to the non-linearity in the data it is impossible to recover the three ground truth archetypes.
Deep AA -non-linear data.For data that has been generated as described in the previous paragraph, a strictly monotone transformation in form of an exponentiation should in general not change which data points are identified as archetypes.But this is clearly the case for linear AA as it is unable to recover the true archetypes after a non-linearity has been applied.Using that same data to train the deep AA architecture presented in Figure 3 generates the latent space structure shown in Figure 4b, where the three archetypes A, B and C have been assigned to the appropriate vertices of the latent simplex.Moreover, the sequence of color stripes shown has been correctly mapped into the latent space.Within the latent space data points are again described as convex linear combinations of the latent archetypes.Latent data points can also be reconstructed in the original data space through the learned decoder network.The network architecture used for this experiment was a simple feedforward network (2 layered encoder and decoder), training for 20 epochs with a batch size of 100 and a learning rate of 0.001.

Archetypes in Image-based Sentiment Analysis
The Japanese Female Facial Expression (JAFFE) database was introduced by Lyons et al. ( 1998) and contains 213 images of 7 facial expressions (6 basic facial expressions + 1 neutral).The expressions are happiness, sadness, surprise, anger, disgust and fear.All expressions were posed by 10 Japanese female models.Each image has been rated on 6 emotion adjectives by 60 Japanese subjects on a 5 level scale (5-high, 1-low) and each image was then assigned a 6-dim.vector of average ratings.For the following experiments the advice of the creator of the JAFFE data set was followed to exclude fear images and the fear adjective from the ratings, as the models were not believed to be good at posing fear.All experiments based on the JAFFE data set are performed on the following architecture2 : Encoder: Decoder (Side Information Branch): Input: latent code t → FC200-5 → side information ỹ ReLU activations are used in-between layers and sigmoid activations for the image intensities.The different losses are weighted as follows: we multiplied the archetype loss by a factor of 80, the side information loss by 560, and the KL divergence by 40.In the setting where only two labels are considered, the weight for archetype loss is increased to 120.The network was trained for 5000 epochs with a mini-batch size of 50 and a learning rate of 0.0001.For training a NVIDIA TITAN X Pascal GPU was used, where a full training sessions lasted approximately 30 minutes.

JAFFE: Latent Space Structure
Emotions conveyed through facial expressions are a suitable case to demonstrate the interpretability of learned latent representation in deep AA.First, the existence of archetypes is plausible as there clearly are expressions that convey a maximum of a given emotion, i. e. a person can look extremely/maximally surprised.Second, facial expressions change continuously without having a clearly defined cluster structure.Moreover, these expressions lend themselves to being interpreted as mixtures of basic (or archetypal) emotional expressionsa perspective also enforced by the averaged ratings for each image which are essentially weight vectors with respect to the archetypal emotional expressions.Figure 5a shows the learned archetypes "happiness", "anger" and "surprise" while expressions linked to the emotion adjective "sadness" are identified as mixtures between archetype 1 (happiness) and archetype 2 (anger).Figure 5b shows the positions of the latent means where the color coding is based on the argmax of the emotion rating, which is a 5-dimensional vector.An analogous situation is found in case of "disgust", which, according to deep AA, is a mixture between archetype 2 (anger) and archetype 3 (surprise).Towards the center of the simplex, expressions are located which share equal weights with respect to the archetypes and thus resemble a more "neutral" facial expression, as shown in figure 8.
Side Information for JAFFE.The JAFFE data set contains facial expressions posed by 10 Japanese female models.Based solely on the visual information, i.e. disregarding the emotion scores, these images could meaningfully be grouped together in a variety of ways, e. g. head shape, hair style, identity of the model posing the expressions... Interpreting resulting archetypes in general requires guiding information that tells the model  which "definition of typicality" it is required to learn.While it is obvious to learn typical emotion expressions in case of JAFFE, most applications are arguably more ambiguous.In section 5.3 a chemical experiment is discussed, where each molecule can be described by a variety of properties.The side information introduced to the learning process will ultimately be the property the experimenter is interested in, and typicality will have to be understood with respect to that property.

JAFFE: Expressions As Weighted Mixtures
One advantage of deep AA compared to the plain Variational Autoencoder (VAE) is a globally interpretable latent structure.All latent means µ i will be mapped inside the convex region spanned by the archetypes.And as archetypes represent extremes of the data set which are present to some percentage in all data points, these percentages or weights can be used to explore the latent space in an informed fashion.This might be especially of advantage in case of higher-dimensional latent spaces.For example will the center of the simplex always accommodate latent representations of input data that are considered mean samples of the data set.Moreover, directions within the simplex have meaning in the sense that when "walking" towards or away from a given archetype, the characteristics of that archetype will either be enforce or diminished in the decoded datum associated with the actual latent position.This is shown in the Hinton plot in Figure 6 where mixture 1 is a mean sample, i. e. with equal archetype weights.Starting at this position and moving on a straight line into the direction of archetype 3 increases its influence while equally diminishing the influence of both archetypes 1 and 2. This results in mixture 2 which starts to look surprised, but not as extremely surprised as archetype 3.In the same fashion mixture 3 and 4 are the results of walking straight into the direction of archetypes 2 or 1 which results in a sad face (mixture 3) and a slightly happy facial expression (mixture 4).

JAFFE: Deep AA Versus VAE
Deep AA is designed to be a model that simultaneously learns an appropriate representation and identifies meaningful latent archetypes.This model can be compared to a plain VAE where a latent space is learned first and subsequently linear AA is performed on that space in order to approximate the latent convex hull.Figure 7a shows the interpolation in the deep AA model between two images, neither of them archetypes, from "happy" to "sad".Compared to Figure 7b, which shows the same interpolation in a VAE model with subsequently performed linear AA, the interpolation based on deep AA gives a markedly better visual impression.In case of deep AA, this is explained by the fact that all data points are mapped into the simplex which ensures a relatively dense distribution of the latent means.On the other hand, the latent space of the VAE model has Fig. 6: Knowing the archetypes allows for an informed exploration of the latent space by not directly sampling latent space coordinates but by specifying a desired mixture with respect to the known archetypes.
no hard geometrical restrictions and thus the distribution of the latent representatives will be less dense or even "patchy", i. e. with larger empty areas in latent space.Especially with small data sets such as JAFFE, of which less than 200 images are used, interpolation quality might be strongly affected by the unboundedness of the latent space of VAE models.

The Chemical Universe Of Molecules
In the following section the application of deep AA to the domain of chemistry is explored.Starting with an initial set of chemical compounds, e. g. small organic molecules with cyclic cores (Visini et al., 2017), and iteratively applying a finite number of reactions, will eventually lead to a huge collection of molecules with extreme combinatorial complexity.But while the total number of all possible small organic molecules has been estimated to exceed 10 60 (Kirkpatrick and Ellis, 2004), even this number pales in comparison to the whole chemical universe of organic chemistry.In general, the efficient exploration of chemical spaces requires methods capable of learning meaningful representations and endowing these spaces with a globally interpretable structure.Prominent examples of chemistry data sets include the family of GDB-xx data sets (generic database), e. g.GDB-13 (Blum and Reymond, 2009), which enumerates small organic molecules of up to 13 atoms, composed of the elements C, N, O, S and Cl, following simple chemical stability and synthetic feasibility rules.With more than 970 million structures, Fig. 8: Latent structure of the JAFFE data set when trained on a subset of the side information containing only the emotion ratings for "sadness" and "disgust".GDB-13 is the largest publicly available database of small organic molecule to date.Exploring The Chemical Space.As discussed in section 2.2, archetypal analysis lends itself to a distinctly evolutionary interpretation.Although this is certainly a more biological perspective, the basic principle is applicable to other fields.In chemistry, the principle of evolutive abiogenesis describes a process in which simple organic compounds increase in complexity (Miller, 1953).In the following experiment a structured chemical space is learned using as side information the heat capacity C v which quantifies the amount of energy (in Joule) needed to increase 1 Mol of molecules by 1 K at constant volume.A high C v number is important e. g. in applications dealing with the storage of thermal energy (Cabeza et al., 2015).In the following, all experiments are based on the QM9 data set (Ramakrishnan et al., 2014;Ruddigkeit et al., 2012) For visualization, the 3-dimensional molecular representations haven been created with Jmol (2019).
Experiment 1: Model Selection.MAE error is assessed while varying the number of archetypes.The result is shown in Figure 9. Model selection is performed by observing for which number of archetypes the MAE starts to converge.The knee of this curve is used to select the optimal number of archetypes, which is 20.Obviously, if the number of archetypes is smaller, it becomes more difficult to reconstruct the data.This is explained by the fact that there exists a large number of molecules with very similar heat capacities but at the same time distinctly different geometric configurations.As a consequence, molecules with different configurations are mapped to archetypes with the similar heat capacity, making it hard to resolve the many-to-one mapping in the latent space.
Experiment 2: Archetypal Molecules.Archetypal molecules are identified along with the heat capacities associated with them.A fixed number of 20 archetypes is used for optimal exploration-exploitation trade-off, in accordance with the model selection discussed in the previous section.In chemistry, the heat capacity at constant volume is defined as C v = d dT v=const where denotes the energy of a molecule and T its temperature.This energy can be further decomposed into dif- ferent parts, such that = T r + R + V + E .Each part is associated with a different degree of freedom of the system.Here, T r stands for translational, R for rotational, V for vibrational and E for the electronic contributions to the total energy of the system (Atkins and de Paula, 2010;Tinoco, 2002).With this decomposition in mind, the different archetypal molecules associated with a particular heat capacity are compared in Figure 10.In both panels of that figure, the rows correspond to the three molecules in the QM9 data set (test set) that have been mapped closest to a vertex of the latent simplex and have thus been identified as being extremes with respect to the heat capacity.Out of a total of 20 vertices, molecules in close proximity to four of them are displayed here.Panel 10a shows the configuration of six archetypal molecules.The upper three are all associated with a low heat capacity while the lower three all have a high heat capacity.This result can easily be interpreted, as the lower heat capacity can be traced back to the shorter chain length and the higher number of double bonds of these molecules, which makes them more stable and results in a lower vibrational energy V and subsequently in a lower heat capacity.The inverse is observed for the linear archetypal molecules with higher heat capacities, which show, relative to their size, a lower number of double bonds and a long linear structure.Panel 10b shows both linear (lower row) and non-linear archetypal molecules (upper row) but with similar atomic mass.Here, the nonlinear molecules containing a cyclic structure in their geometry, are more stable and therefore have an overall slightly lower heat capacity compared to their linear counterparts of the same weight, shown in the second row.
Experiment 3: Interpolation Between Two Archetypal Molecules.Interpolation is performed by plotting the samples from the test set which are closest to the connecting line between the two archetypes.As a result, one can observe a smooth transition from a molecule with a ring structure to a linear chain molecule.Both the starting and the end point of this interpolation is characterized by a similar heat capacity, such that these archetypes differ only in their geometric configuration but not with respect to their side information.As a consequence, any molecule in close proximity to that connecting line can differ only with respect to its structure, but must display a similarly high heat capacity. Figure 11 shows an example of such an interpolation.
Experiment 4: The Role Of Side Information And The Exploration Of Chemical Space.Deep AA structures latent spaces both according to the information contained in the input to the encoder as well as the side information provided.As a consequence, any molecule characterized as a true mixture of two or more archetypes, given a specific side information such as heat capacity, might suddenly be identified as archetypal should the side information change accordingly.In the following, archetypal molecules with respect to heat capacity as the side information are compared to archetypes obtained while providing the band gap energy of each molecule as the side information.In Figure 12a archetypal molecules with both the highest and the lowest heat capacities are displayed while 12b shows archetypes with highest and lowest band gap energies.The archetypes significantly differ in their structure as well as their atomic composition.For example, archetypal molecules with low heat capacity are rather small, with only few C and O atoms, while archetypal molecules with a low band gap energy are characterized by ring structures containing N and H atoms.This illustrates how essential side information is for defining typicality but also for the subsequent interpretation of the obtained structure of the latent space.

Alternative Priors For Deep Archetypal Analysis
The standard normal distribution is a typical choice for the prior distribution p(t) due to its simplicity and closed form solutions for the KL divergence.However, employing alternative priors might be beneficial for the structure of the latent space and have an impact on the identified archetypes.Leaving the wide range of well explored priors for vanilla VAEs aside, we explore a hierarchical prior that directly corresponds to the generative model of linear AA (Eq.4), i.e. isotropic Gaussian noise around a linear combination of the archetypes: We rely on Monte-Carlo sampling for the estimation of the KL divergence in Eq. 7.For comparing the different priors qualitatively, we run Deep AA on the Japanese Face Expressions with four archetypes.The architecture used is similar to the previous experiments but we additionally learn the variance of the decoder.The Lagrange parameters or weights in Eq. 14 are set to 1000 for the archetype loss and to 100 for the KL divergence.
Figure 13 shows examples of the found archetypes for the standard normal prior and the sampling Dirichlet prior.In general, different priors do not seem to strongly affect the found archetypes.However, the latent spaces do differ, which can be seen in the projection to the first two principal components in Figure 14.As a reference, a uniformly filled simplex would result in a triangle in the projection.The difference is caused by large gaps in the higher-dimensional simplex when using the hierarchical prior, which we assume is mainly due to the high variance estimation of the KL divergence.
In our experience, the choice of the prior is not of primary concern for finding archetypes, as long as it encourages the latent space to be spread out inside the simplex -be that via a standard normal, a uniform or the presented hierarchical prior.

Conclusion
We have presented in this paper an extension of linear Archetypal Analysis, a technique for exploratory data  analysis and interpretable machine learning.By performing Archetypal Analysis in the latent space of a Variational Autoencoder, we have demonstrated that the learned representation can be structured in a way that allows it to be characterized by its most extremal or archetypal representatives.As a result, each observation in a data set can be described as a convex mixture of these extremes.Endowed with such a structure, a latent space can be explored by varying the mixture coefficients with respect to the archetypes, instead of exploring the space by uniform sampling.Furthermore, we have demonstrated the need for including side information into the process of learning latent archetypal representations.As extremity is a form of typicality, a definition of what is understood to be "typical" needs to be given first.Providing such a definition is the role of side information and allows learning interpretable archetypes.In contrast to the original  archetype model, our method offers three advantages: First, our model learns representations in a data-driven fashion, thereby reducing the need for expert knowledge.Second, our model can learn appropriate transformations to obtain meaningful archetypes, even if nonlinear relations between features exist.Third, the incorporation of side information.The application of this new method is demonstrated on a sentiment analysis task, where emotion archetypes are identified based on female facial expressions, for which multi-rater based emotion scores are available as side information.A second application illustrates the exploration of the chemical space of small organic molecules and demonstrated how crucial side information is for interpreting the geometric configuration of these molecules.

Fig. 2 :
Fig.2: Phenospace of different species of Microchiroptera.The dominant food habit of each species, and thereby the ability to procure this food source, is linked to the morphology of the animals, e.g. a higher Wing Aspect Ratio corresponds with the greater aerodynamic efficiency needed to chase high flying insects.Archetypes are extreme types, optimized to perform a single task.Proximity of a species to an archetype quantifies the level of adaptation this species has undergone with respect to the optimization objective or task.Reprinted fromShoval et al. (2012) with permission.
(a) Linear AA is unable to recover the true archetypes.(b) Latent space embedding of non-linear artificial data.

Fig. 4 :
Fig. 4: While linear archetypal analysis is in general unable to approximate the convex hull of a non-linear data set well, Deep AA learns an appropriate latent representation where the ground truth archetypes can correctly be identified.
(a) Archetype latent space of the JAFFE data set.(b) Location of emotion adjectives in latent space.

Fig. 5 :
Fig. 5: Deep AA with k = 3 archetypes identifies sadness as a mixture mostly between happiness and anger while disgust lies between the archetypes for anger and surprise.
Fig. 7: The location of two input images were located in the latent space of the deep AA and the VAE model with subsequently performed linear AA.The interpolation is qualitatively better in case of deep AA where latent means are mapped more densely together due to the simplex constraints.

Fig. 9 :
Fig. 9: Model selection on the QM9 data set: Mean absolute error (reconstruction loss) vs. number of archetypes on the test set.

Fig. 10 :
Fig. 10: Both panels illustrate a comparison between archetypal molecules, where typicality is understood with respect to the molecular property heat capacity.Each column contains the three molecules of the test set that have been mapped closest to a specific vertex of the latent simplex.Panel (a) compares archetypal linear molecules characterized by a short chain structure versus long chained molecules.Panel (b) compares archetypal molecules with similar masses but different geometric configuration, i. e. with and without a cyclic structure.

Fig. 11 :
Fig. 11: Interpolation between two archetypal molecules produced by deep AA.The labels display the heat capacity of each molecule.Here, only a single example is shown but similar results can be observed for other combinations of archetypes.

Fig. 12 :
Fig. 12: Panels (a) and (b) compare archetypal molecules identified using different side information: Here, the labels correspond to the heat capacity (panel a) and the band gap energy (panel b).The columns contain the three molecules of the test set closest to the given archetype.
(a) Archetypes learned using the sampling Dirichlet prior.(b) Archetypes learned using the standard normal prior.

Fig. 13 :
Fig. 13: Deep AA with k = 4 archetypes using two different priors, which both identify similar archetypes.
(a) JAFFE latent space with sampling Dirichlet prior.(b) JAFFE latent space with standard normal prior.

Table 1 :
Inferred specialization of the archetypal species of Microchiroptera indicated in Figure