# Probabilistic archetypal analysis

## Abstract

Archetypal analysis represents a set of observations as convex combinations of pure patterns, or archetypes. The original geometric formulation of finding archetypes by approximating the convex hull of the observations assumes them to be real–valued. This, unfortunately, is not compatible with many practical situations. In this paper we revisit archetypal analysis from the basic principles, and propose a probabilistic framework that accommodates other observation types such as integers, binary, and probability vectors. We corroborate the proposed methodology with convincing real-world applications on finding archetypal soccer players based on performance data, archetypal winter tourists based on binary survey data, archetypal disaster-affected countries based on disaster count data, and document archetypes based on term-frequency data. We also present an appropriate visualization tool to summarize archetypal analysis solution better.

### Keywords

Archetypal analysis Probabilistic modeling Majorization–minimization Visualization Convex hull Binary observation## 1 Introduction

Archetypal analysis (AA) represents observations as composition of pure patterns, i.e., *archetypes*, or equivalently convex combinations of extreme values (Cutler and Breiman 1994). Although AA bears resemblance with many well established prototypical analysis tools, such as principal component analysis (PCA, Mohamed et al. 2009), non-negative matrix factorization (NMF, Févotte and Idier 2011), probabilistic latent semantic analysis (Hofmann 1999), and *k*-means (Steinley 2006); AA is arguably unique, both conceptually and computationally. Conceptually, AA imitates the human tendency of representing a group of objects by its extreme elements (Davis and Love 2010): this makes AA an interesting exploratory tool for applied scientists (e.g., Eugster 2012; Seiler and Wohlrabe 2013). Computationally, AA is *data-driven*, and requires the *factors* to be probability vectors: these make AA a computationally demanding tool, yet brings better interpretability.

The concept of AA was originally formulated by Cutler and Breiman (1994). The authors posed AA as the problem of learning the convex hull of a point-cloud, and solved it using alternating non-negative least squares method. In recent years, different variations and algorithms based on the original geometrical formulation have been presented (Bauckhage and Thurau 2009; Eugster and Leisch 2011; Mørup and Hansen 2012). However, unfortunately, this framework does not tackle many interesting situations. For example, consider the problem of finding archetypal response to a binary questionnaire. This is a potentially useful problem in areas of psychology and marketing research that cannot be addressed in the standard AA formulation, which relies on the observations to exist in a vector space for forming a convex hull. Even when the observations exist in a vector space, standard AA might not be an appropriate tool for analyzing it. For example, in the context of learning archetypal text documents with tf-idf as features, standard AA will be inclined to finding archetypes based on the volume rather than the content of the document.

*to form the convex hull in the parameter space*. The parameter space is often vectorial even if the sample space is not (see Fig. 2 for the plate diagram). We solve the resulting optimization problem using majorization–minimization, and also suggest a visualization tool to help understand the solution of AA better.

The paper is organized as follows. In Sect. 2, we start with an in-depth discussion on what archetypes mean, and how this concept has evolved over the last decade, and has been utilized in different contexts. In Sect. 3, we provide a probabilistic perspective of this concept, and suggest *probabilistic archetypal analysis*. Here we explicitly tackle the cases of Bernoulli, Poisson and multinomial probability distributions, and derive the necessary update rules (derivations available as appendix). In Sect. 4, we discuss the connection between AA and other prototypical analysis tools—a connection that has also been partly noted by other researchers (Mørup and Hansen 2012). In Sect. 5 we provide simulations to show the difference betweeen probabilistic and standard archetypal analysis solutions. In Sect. 6, we discuss a visualization method for archetypal analysis, and present several improvements. In Sect. 7, we present an application for each of the above observation models: finding archetypal soccer players based on performance data; finding archetypal winter tourists based on binary survey data; finding archetypal disaster-affected countries based on disaster count data; and finding document archetypes based on term-frequency data. In Sect. 8 we summarize our contribution, and suggest future directions.

Probability distributions used in the paper

Distribution | Notation | Pdf/pmf |
---|---|---|

Normal | \(\mathcal {N}({\varvec{\mu }}, {\varvec{\varSigma }})\) | \((2\pi )^{-\frac{K}{2}}|{\varvec{\varSigma }}|^{-\frac{1}{2}} \exp \{-\frac{1}{2}({\varvec{x}} - {\varvec{\mu }})'{\varvec{\varSigma }}^{-1} ({\varvec{x}} - {\varvec{\mu }})\}\) |

\({\varvec{\mu }} \in \mathbb {R}^K\), \({\varvec{\varSigma }} \in \mathbb {R}^{K \times K}\) | ||

Dirichlet | \(\mathrm{Dir}({\varvec{\alpha }})\) | \(\frac{1}{\mathrm{B}(\alpha )} \prod ^K_{i = 1} x^{\alpha _i - 1}_i\) where \(\mathrm{B}(\alpha ) = \frac{\prod ^K_{i = 1} \varGamma (\alpha _i)}{\varGamma (\sum ^K_{i=1} \alpha _i)}\) |

\({\varvec{\alpha }} = (\alpha _1, \ldots , \alpha _K)\), | ||

\(K > 1\), \(\alpha _i > 0\) | ||

Poisson | \(\mathrm{Pois}(\lambda )\) | \(\frac{\lambda ^x}{x!} \exp \{-\lambda \}\) |

\(\lambda > 0\) | ||

Bernoulli | \(\mathrm{Ber}(p)\) | \(p^x (1 - p)^{1-x}\) |

\(0 < p < 1\) | ||

Multinomial | \(\mathrm {Mult}(n, \mathbf {p})\) | \(\frac{n!}{x_1! \cdots x_K!}p_1^{x_1} \cdots p_K^{x_K}\) |

\(n > 0\), | ||

\({\varvec{p}} = (p_1, \ldots , p_K)\), | ||

\(\sum ^K_{i=1} p_i = 1\) |

## 2 Review

The goal of archetypal analysis is to find archetypes, ‘pure’ patterns. In Sect. 2.1 we provide some intuition on what these pure patterns imply. In Sect. 2.2, we discuss the mathematical formulation of this archetypal analysis as suggested by Cutler and Breiman (1994). In Sect. 2.3 we discuss how this concept has been utilized since its inception: here, we point out key references, important developments, and convincing applications.

### 2.1 Intuition

Archetypes are ‘ideal example of a type’. The word ‘ideal’ does not necessarily have a qualitative meaning of being ‘good’, but this concept is mostly subjective. For example, one can consider ideal example to be a prototype, and other objects to be variations of such prototype. This view is close to the concept of clustering, where the centers of the clusters are the prototypes. For archetypal analysis, however, ideal example has a different meaning. Intuitively it implies that the prototype *can not be surpassed*, its the purest or the most extreme that can be witnessed. A simple example of archetypes are the colors red, blue and green (cf. Fig. 1a) in the RGB color space: any other color can be expressed as combinations of these ideal colors. Another example can be comic book superheros who excel in some unique characteristics, say speed or stamina or intelligence, more than anybody with these abilities: they are the archetypal superheros with that particular ability. The non-archetypal superheroes, on the other hand, possess “many” abilities that are not extreme. It is to be noted that a person with all the abilities to their full realizations, if exists, is an archetype, Similarly, if one considers normal humans alongside super-humans then a person with none of these abilities is also an archetype.

In both these examples, the archetypes are rather trivial. If one represents each color in the RGB space then it is obvious that the unit vectors R, G and B are pure colors or archetypes. Similarly, if one represents every (super-)human in a two dimensional normalized scale of strength and intelligence, then there are four extreme instances, and hence archetypes are: first and second, person with highest score in either of these attributes and none in the other; third and fourth, person with highest/lowest score in both these attributes. However, in reality one may not observe these attributes directly, but some other features. For example, one can describe a person with many personality traits, such as humor, discipline, optimism, etc., but these characteristics cannot be measured directly. However, one can prepare a questionnaire (or observed variables) that explores these (latent) personality traits. From this questionnaire, curious users can attempt to identify archetypal humans, say an archetypal leader or an archetypal jester.

Finding archetypal patterns in the observed space is a non-trivial problem. It is difficult, in particular, since the archetype itself may not be belong to the set of observed samples but *should be inferred*; yet, it should also not be “mythological”, but rather something that *might be observed*. Cutler and Breiman (1994) suggested a simple yet elegant solution that finds the approximate convex hull of the observations, and define the vertices as archetypes. This allows individual observations to be best represented by composition (convex combination) of archetypes, while archetypes can only be expressed by themselves, i.e., they are the ‘purest form’ or ‘most extreme’ forms. Although, it is certainly not the most desired solution, since, the inferred archetypes are restricted to be on the boundary of the convex hull of the observations, whereas true archetype may be outside; inferring such archetypes outside the observation hull will require strong regularity assumptions. The solution suggested by Cutler and Breiman (1994) finds a trade off between computational simplicity, and the intuitive nature of archetype.

### 2.2 Formulation

*N*observations, and

*K*archetypes, \(\mathbf{W }\) is \(N\times K\) dimensional matrix, and \(\mathbf{H }\) is \(K \times N\) dimensional matrix. Here, \(\mathbf{Z }=\mathbf{X }\mathbf{W }\) are the inferred archetypes that exist on the convex hull of the observations due to the stochasticity of \(\mathbf{W }\) and for each

*n*-th sample \(\mathbf{x }_n\), \(\mathbf{Z }\mathbf{h }_n\) is its projection on the convex hull of the archetypes.

### 2.3 Development

The first publication, to the best of our knowledge, which deals with the idea of “ideal types” and observations related to them, is Woodbury and Clive (1974). There, the authors discuss how to derive estimates of grades of membership of categorical observations, given an a-priori defined set of ideal (or pure) types, in the context of clinical judgment. Twenty years later—in 1994—Cutler and Breiman (1994) formulated archetypal analysis (AA) as the problem of estimating both the membership and the ideal types given a set of real-valued observations. They motivated this new kind of analysis with, among other examples, the estimation of archtyepal head dimensions of Swiss Army soldiers.

One of the original authors continued her work on AA in the fields of physics and applied it on spatio-temporal data (Stone and Cutler 1996). In this line of research, Cutler and Stone (1997) developed *moving archetypes*, by extending the original AA framework with an additional optimization step, which estimates the optimal shift of observations in the spatial domain over time. They applied this method to data gathered from a chemical pulse experiment. Other researches took up the idea of AA and applied it in different fields; the following is a comprehensive list of problems where other researchers have applied AA: analysis of galaxy spectra (Chan et al. 2003), ethical issues and market segmentation (Li et al. 2003), thermogram sequences (Marinetti et al. 2007), gene expression data (Thøgersen et al. 2013); performance analysis in marketing (Porzio et al. 2008), sports (Eugster 2012), and science (Seiler and Wohlrabe 2013); face recognition (Xiong et al. 2013); and in game AI development (Sifa and Bauckhage 2013).

In recent years, animated by the rise of the non-negative matrix factorization research, various authors have proposed extensions and variations to the original algorithm. The following are a few notable publications. Thurau et al. (2009) introduce the convex-hull non-negative matrix factorization (NMF). Motivated by the convex NMF, the authors make the same assumption as made in AA that observations are convex combinations of specific observations. However, they derive an algorithm which estimates the archetypes not from the entire set of observations but from potential candidates found from 2-dimensional projections on eigenvectors: this leads to a solution also applicable for large data sets. The authors demonstrate their method on a data set consisting of 150 million votes on World of \(\hbox {Warcraft}^{\textregistered }\) guilds. In Thurau et al. (2010), the authors present an even faster approach by deriving a highly efficient volume maximization algorithm. Eugster and Leisch (2011) tackle the problem of robustness, and that a single outlier can break down the archetype solution. They adapt the original algorithm to be a robust M-estimator and present an iteratively reweighted least squares fitting algorithm. They evaluate there algorithm using the Ozone data from the original AA paper with contaminated observations. Mørup and Hansen (2012) also tackle the problem of deriving an algorithm for large scale AA. They propose a solution based on a simple projected gradient method, in combination with an efficient initialization method for finding candidates of archetypes. The authors demonstrate their method, among other examples, with an analysis of the NIPS bag of words corpus and the Movielens movie rating data set.

## 3 Probabilistic archetypal analysis

*simplex latent variable model*, and normal observation model, i.e.,

*known*loadings \({\varvec{\varTheta }} \in \mathbb {R}^{M\times N}\), i.e.,

The equivalence of this approach to the standard formulation requires that \({\varvec{\varTheta }} = \mathbf{X }\). Although unusual in a probabilistic framework, this contributes to the data-driven nature of AA. In the probabilistic framework, \({\varvec{\varTheta }}\) can be viewed as a set of known bases that is defined by the observations, and the purpose of archetypal analysis is to find a *sparse* set of bases that can explain the observations. These inferred bases are the archetypes, and the stochasticity constraints on \(\mathbf{W }\) and \(\mathbf{H }\) ensure that they are the extreme values as one desires. It should be noted that \({\varvec{\varTheta }}_n\) does not need to correspond to \(\mathbf{X }_n\): more generally, \({\varvec{\varTheta }} = \mathbf{X }\mathbf{P }\) where \(\mathbf{P }\) is a permutation matrix.

### 3.1 Exponential family

*s*. Notice that,

*we employ the normal parameter*\({\varvec{\theta }}\)

*rather than the natural parameter*\({\varvec{\eta }}({\varvec{\theta }})\), since the former is more interpretable. In fact, the convex combination of \({\varvec{\theta }}\) is more interpretable than the convex combination of \({\varvec{\eta }}({\varvec{\theta }})\), as a linear combination on \({\varvec{\eta }}({\varvec{\theta }})\) would lead to nonlinear combination of \({\varvec{\theta }}\). To adhere to the original formulation, we suggest \({\varvec{\varTheta }}_{\cdot n}\) to be the

*maximum likelihood point estimate*from observation \(\mathbf{X }_{\cdot n}\). Again, the columns of \({\varvec{\varTheta }}\) and \(\mathbf{X }\) do not necessarily have to be corresponded. Then, we find archetypes \(\mathbf{Z }={\varvec{\varTheta }}\mathbf{W }\) by solving

*probabilistic archetypal analysis*(PAA).

The meaning of *archetype* in PAA is different than in the standard AA since the former lies in the parameter space, whereas the latter in the observation space. To differentiate these two aspects, we call the archetypes \(\mathbf{Z }={\varvec{\varTheta }}\mathbf{W }\) found by PAA [solving (9)], *archetypal profiles*: our motivation is that \({\varvec{\varTheta }}_{\cdot n}\) can be seen as the parametric *profile* that best describes the single observation \(\mathbf{x }_n\), and thus, \(\mathbf{Z }\) are the archetypal profiles that are inferred from them. We generally refer to the set of indices that contribute to the *k*-th archetypal profile, i.e., \(\{i : \mathbf{W }_{ik} > \delta \}\), where \(\delta \) is a small value, as *generating observations* of that archetype. Notice that, when the observation model is multivariate normal with identity covariance, then this formulation is the same as solving (1). We explore some other examples of \(\mathrm{EF}\): multinomial, product of univariate Poisson distributions, and product of Bernoulli distributions.

### 3.2 Poisson observations

### 3.3 Multinomial observations

In many practical problems such as document analysis, the observations can be thought of as originating from a multinomial model. In such cases, PAA expresses the underlying multinomial probability as \(\mathbf{P }\mathbf{W }\mathbf{H }\) where \(\mathbf{P }\) is the maximum likelihood estimate achieved from word frequency matrix \(\mathbf{X }\). This decomposition is very similar to PLSA: PLSA estimates a topic by document matrix \(\mathbf{H }\) and a word by topic matrix \(\mathbf{Z }\), while AA estimates a document by topic matrix (\(\mathbf{W }\)) and a topic by document matrix (\(\mathbf{H }\)) from which the topics can be estimated as archetypes \(\mathbf{Z }=\mathbf{P }\mathbf{W }\). Therefore, the archetypal profiles are effectively topics, but topics might not always be archetypes. For instance, given three documents {A,B}, {B,C}, {C,A}; the three topics could be {A}, {B}, and {C}, whereas the archetypes can only be the documents themselves. Thus, it can be argued that archetypes are topics with better interpretability.

### 3.4 Bernoulli observations

## 4 Related work

*Principal component analysis and extensions:*Principal component analysis (PCA) finds an orthogonal transformation of a point-cloud, which projects the observations in a new coordinate system that preserves the variance of the point-cloud the best. The concept of PCA has been extended to a probabilistic as well as a Bayesian framework (Mohamed et al. 2009). Probabilistic PCA (PPCA) assumes that the data originates from a lower dimensional subspace on which it follows a normal distribution (\(\mathcal {N}\)), i.e.,

*normally distributed*: an assumption that is often violated in practice, and to tackle such situations one extends PPCA to exponential family (\(\mathrm{EF}\)). The underlying principle here is to change the observation model accordingly:

*s*(

*z*) is the sufficient statistic, and \(\theta \) is the

*natural parameter*: PAA utilizes similar approach but with normal parameters.

Similarly, one can also manipulate the latent distribution. A popular choice is the Dirichlet distribution (\(\mathrm {Dir}\)), which has been widely explored in the literature, e.g., in probabilistic latent semantic analysis (PLSA (Hofmann 1999)), \(\mathbf{h }_n \sim \mathrm {Dir}(\mathbf{1 }), \,\mathbf{x }_n \sim \mathrm {Mult}(\mathbf{Z }\mathbf{h }_n), \text{ where } \mathbf{1 }\mathbf{Z } = \mathbf{1 }\); vertex component analysis (do Nascimento and Dias 2005), \(\mathbf{h }_n \sim \mathrm {Dir}(\mathbf{1 }), \,\mathbf{x }_n \sim \mathcal {N}(\mathbf{Z }\mathbf{h }_n,\epsilon \mathbf{I })\); and simplex factor analysis (Bhattacharya and Dunson 2012), a generalization of PLSA (or more specifically of latent Dirichlet allocation, LDA (Blei et al. 2003)): PAA additionally decompose the loading in simplex factors with known loading.

*Nonnegative matrix factorization and extensions:* Non-negative matrix factorization (NMF) decomposes a non-negative matrix \(\mathbf{X } \in \mathbb {R}_+^{M\times N}\) in two non-negative matrices \(\mathbf{Z } \in \mathbb {R}_+^{M\times K}\) and \(\mathbf{H } \in \mathbb {R}_+^{K\times N}\) such that \(\mathbf{X } \approx \mathbf{Z }\mathbf{H }\). (Lee and Seung 1999) applied the celebrated *multiplicative update rule* to solve this problem, and proved that such update rules lead to monotonic decrease in the cost function using the concept of *majorization–minimization* (Lee and Seung 2000). Non-negative matrix factorization has been extended to convex non-negative matrix factorization (C-NMF (Ding et al. 2010)) where \(\mathbf{X }\) is not restricted to be non-negative, and \(\mathbf{Z }\) is expressed in terms of the \(\mathbf{X }\) itself as \(\mathbf{Z } = \mathbf{X }\mathbf{W }\), where \(\mathbf{W }\) is again a non-negative matrix. The motivation for this modification emerges from its similarity to clustering, and C-NMF has been solved using multiplicative update rule as well.

To simulate the exact clustering scenario, however, \(\mathbf{H }\) is required to be binary (hard clustering) or at least column stochastic (fuzzy clustering). This leads to a more difficult optimization problem, and is usually solved by proxy constraint \(\mathbf{H }^\top \mathbf{H } = \mathbf{I }\) (Ding et al. 2006). Several other alternatives have also been proposed for tackling the stochasticity constraints, e.g., by enforcing it after each iteration (Mørup and Hansen 2012), or by employing a gradient-dependent Lagrangian multiplier (Yang and Oja 2012). However, both these approaches are prone to finding local minima.

## 5 Simulation

In this section, we provide some simple examples showing the difference between probabilistic and standard archetypal analysis solutions. Since we generate data following the true probabilistic model, it is expected that the solution provided by PAA would be more appropriate compared to the standard AA solution. Therefore, the purpose of this section is to perform sanity check, and provide insight. Notice that generating observations with known archetypes is not straight forward, since \({\varvec{\varTheta }}\) depends on \(\mathbf{X }\).

*Binary observations:*We generate \(K=6\) binary archetypes in \(d=10\) dimensions by sampling \(\eta _{ik} \sim \mathrm{Bernoulli}(p_s)\), where \({\varvec{\eta }}_{k}\) is an archetype, and \(p_s=0.3\) is the probability of success. Given the archetypes, we generate \(n=100\) observations as \(\mathbf{x }_i \sim \mathrm{Bernoulli}(\mathbf{E }\mathbf{h }_i)\), where \(\mathbf{E }=[{\varvec{\eta }}_1,\ldots ,{\varvec{\eta }}_k]\), and each \(\mathbf{h }_i\) is a stochastic vector sampled from \(\mathrm {Dir}(\alpha )\). To ensure that \({\varvec{\eta }}\)’s are archetypes, we maintain more observations around \({\varvec{\eta }}_k\)s by choosing \(\alpha =0.4\). We find archetypal profiles using both PAA and standard AA, and then binarize them so that they can be matched to the original archetypes using minimum Jaccard distance. We report the results in Fig. 4. We observe that the archetypal profiles found by PAA are more accurate with 15 % more archetypal profiles matching uniquely to the true archetypes than standard AA.

*Poisson observations:*We generate \(K=6\) count archetypes in \(d=12\) dimensions with one minimal archetype (\(\eta _{ik}=0\)), one maximal archetype \(\eta _{ik} \sim \mathrm{Unif} \{1,\ldots ,10\}\), and rest of the archetpyes with two nonzero entries \(\eta _{ik}\ \sim \mathrm{Unif}\{1,\ldots ,10\}\). Given the archetypes, we generate \(n=500\) observations as \(\mathbf{x }_i \sim \mathrm{Poisson}(\mathbf{E }\mathbf{h }_i)\), where \(\mathbf{E }=[{\varvec{\eta }}_1,\ldots ,{\varvec{\eta }}_k]\), and each \(\mathbf{h }_i\) is a stochastic vector sampled from \(\mathrm {Dir}(\alpha )\). To ensure that \({\varvec{\eta }}\)’s are archetypes, we maintain more observations around \({\varvec{\eta }}_k\)s by choosing \(\alpha =0.4\). We find archetypal profiles using both PAA and standard AA, and match them to the original archetypes using minimum \(l_1\) distance. We report the results in Fig. 5. We observe that the archetypal profiles found by PAA are more accurate with 9 % more archetypal profiles matching uniquely to the true archetypes than standard AA.

*Term-frequency observations:*We generate \(K=5\) archetypes on \(d=3\) dimensional probability simplex by choosing

*K*equidistant points \(\mathbf{p }_k\) on a circle in the simplex. Given the archetypes, we generate \(n=500\) observations as \(\mathbf{x }_i \sim \mathrm{Mult}(n_i,\mathbf{P }\mathbf{h }_i)\), where \(\mathbf{P }=[{\varvec{p}}_1,\ldots ,{\varvec{p}}_k]\), and each \(\mathbf{h }_i\) is a stochastic vector sampled from \(\mathrm {Dir}(\alpha )\). To ensure that \(\mathbf{p }\)’s are archetypes, we maintain more observations around them by choosing \(\alpha =0.5\). We deliberately choose an arbitrary number of occurrences \(n_i \sim \mathrm{Uniform} [1000,2000]\) for each observation: this disrupts the true convex hull structure in the term-frequency observations. We present 10 random runs on these observations for both PAA and standard AA in Fig. 6. We observe that PAA finds the effective archetypes, with occasional local minima. However, standard AA performs poorly since it finds the appropriate archetypes in the term-frequency space, which are different when projected back on the simplex.

## 6 Simplex visualizations

The column stochasticity of \(\mathbf{H }\) allows a principled visualization scheme of archetypal analysis solution, referred to as *simplex visualization*. We discuss certain aspects and enhancements of this approach, and show how it can be utilized to better understand the inferred archetypes.

*K*archetypes \(\mathbf{Z }\) as the corners, and \(\mathbf{h }_n\) as the coordinate with respect to these corners (see Fig. 1b for an illustration). A standard simplex can be projected to two dimensions via a skew orthogonal projection, where all the vertices of the simplex are shown on a circle connected by edges. The individual factors \(\mathbf{h }_n\) can be then projected into this circle. Figure 8 illustrates this principle with the simple data set already used in Fig. 1: (a, b) for the three archetypes solution; (g, h) for the four archetypes solution; (j, k) for the five archetypes solution. Color coding is used in the six figures to show the relation between the original observation \(\mathbf{x }_n\) and its projection \(\mathbf{h }_n\). The visualization with three archetypes is known as ternary plot (see, e.g., Friendly 2000), and has been used by Cutler and Breiman (1994). The extension to more than three archetypes has also been used (e.g., Bauckhage and Thurau 2009; Eugster and Leisch 2013). However, a formal study of this visualization scheme, to the best of our knowledge, remains to be explored. In the following, we present three enhancements of this basic visualization to either highlight certain characteristics of an archetypal analysis solution or to overcome consequences of the one-to-many projection.

In AA, observations which lie outside the approximated convex hull are projected onto its boundary. Figure 8a, b show a simple scenario where these observations are projected onto the corresponding edges. Figure 8d, e show an extreme case of this characteristic: the observations lie on a three-dimensional sphere, and therefore the computed archetypes span a space, which is completely empty. In the corresponding simplex visualization, however, this aspect of the solution is not visible at all. We propose to visualize the ‘deviance’ \(D(\mathbf{x }_n) = 2(\log {p(\mathbf{x }_n|\theta _n)} - \log {p(\mathbf{x }_n|\mathbf{Z }\mathbf{h }_n)})\) where \(\theta _n\) is the maximum likelihood estimate of \(\mathbf{x }_n\), as colors of the points. In case of normal observation model the deviance reduces to the residual sum of squares. Figure 8c shows the corresponding simplex visualization with deviance for the three archetypes solution. The color scheme is from blue to white, with blue implying zero deviance and the lighter the color, the higher the deviance. We can now identify how well the original observations have been approximated by the archetypes, and if they are inside the convex hull. This extension is even more insightful in case of the sphere example. In the corresponding simplex visualization in Fig. 8f, we can now clearly see that almost all observations are outside the approximated convex hull; only around the corners the deviance is near to zero.

The basic simplex visualization arranges the vertices, which represents the archetypes, equidistant on the circle. The archetypes in the original space, however, are usually not equidistant to each other. Figure 8g and h illustrate this discrepancy: archetype A2, for example, is much nearer to A3 than A4; in the simplex visualization, however, both are in the same distance. We propose to order the vertices on the circle according to their distances in the original space. This means, we first have to determine an optimal order of the vertices, and then divide the \(360^\circ \) of the circle in relation to the original pairwise distances of the determined neighbor vertices. Here, we solve a Traveling Salesman Problem to get an optimal cyclic order (solved by using, for example, Hahsler and Hornik 2007); and then simply divide the circle proportional to the original distances. Figure 8i shows the result: it is now clearly visible that A2 and A3 are much nearer to each other than A3 and A4.

Another problem of the simplex visualization with more than three archetypes is the *non-uniqueness*. As a result two projections \(\mathbf{h }_{n1}\) and \(\mathbf{h }_{n2}\) can be close to each other even though they are *composed* of different archetypes. This goes against one’s intuition in judging which archetypes the observations belong to. For example, the observations inside the dashed circle in Fig. 8k. We get the idea that these observations are basically composed by A1, A2, A5, and/or A4—but we do not get the exact compositions. We therefore propose to show ‘whiskers’, which point in the direction of the composing archetypes, Fig. 8l shows the corresponding visualization. We can now easily see the composition of the observations inside the dashed circle: the observations on the right side of the line are composed by A1 and A4; the observations on the left side of the line are composed by A1, A2, A5; and the left-most observation is composed by A1, A5. We vary the length of the ‘whiskers’ according to the coefficients \(\mathbf{h }_n\); the longer the whisker the closer the observation is to the archetype the whisker points toward.

## 7 Applications

### 7.1 Gaussian observations: archetypal soccer players

We use a data set already explored by Eugster (2012) for archetypal analysis, to evaluate the solution provided by maximizing (6). We analyze soccer players playing in four European top leagues. The extracted data set consists of \(M = 25\) skills of \(N = 1658\) players (all positions—Defender, Midfielder, Forward—except Goalkeepers) from the German Bundesliga, the English Premier League, the Italian Serie A, and the Spanish La Liga. The skills are rated from 0 to 100 and describe different abilities of the players: physical abilities like balance, stamina, and top speed; ball skills like dribble, pass, and shot accuracy and speed; and general skills like attack and defence performance, technique, aggression, and teamwork. Note that Eugster (2012) assumes that the differences are interpretable, i.e., the ratings are on a ratio scale. We compute \(K = 4\) archetypes, as in Eugster (2012).

### 7.2 Multinomial observations: NIPS bag-of-words

*k*-means prototype. But, to the best of our knowledge a “Bayesian Paradigm” archetypal document can make sense in a NIPS corpus.

### 7.3 Bernoulli observations: Austrian national guest survey

The six archetypal profiles for the winter tourists example

A1 | A3 | A5 | A4 | A6 | A2 | |
---|---|---|---|---|---|---|

Alpine Ski | 1.00 (1) | 1.00 (1) | 1.00 (1) | 1.00 (1) | 0.00 (0) | 0.00 (0) |

Tour Ski | 0.41 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) |

Snowboard | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.59 (1) | 0.00 (0) | 0.00 (0) |

Cross Country | 0.75 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) |

Ice Skating | 0.60 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) |

Sledge | 1.00 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) |

Tennis | 0.15 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.20 (0) |

Riding | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.08 (0) |

Pool Sauna | 0.96 (1) | 0.00 (0) | 0.37 (0) | 1.00 (1) | 0.82 (1) | 0.11 (0) |

Spa | 0.22 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.79 (1) | 0.00 (0) |

Hiking | 0.95 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 1.00 (1) | 0.18 (0) |

Walk | 1.00 (1) | 0.00 (0) | 1.00 (1) | 0.00 (0) | 1.00 (1) | 1.00 (1) |

Excursion (org) | 0.29 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.41 (1) |

Excursion (ind) | 0.81 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.94 (1) | 1.00 (1) |

Relax | 0.99 (1) | 0.00 (0) | 1.00 (1) | 1.00 (1) | 1.00 (1) | 0.81 (0) |

Dinner | 0.82 (1) | 0.53 (0) | 0.86 (1) | 0.02 (0) | 0.00 (0) | 1.00 (1) |

Shopping | 1.00 (1) | 0.00 (0) | 1.00 (1) | 0.01 (0) | 0.33 (0) | 1.00 (1) |

Concert | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.29 (1) |

Sightseeing | 0.66 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.66 (1) | 1.00 (1) |

Heimat | 0.58 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) |

Museum | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 1.00 (1) |

Theater | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.30 (1) |

Heurigen | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.45 (0) |

Local event | 0.99 (1) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.10 (0) |

Disco | 0.68 (1) | 0.00 (0) | 0.00 (0) | 1.00 (1) | 0.00 (0) | 0.08 (0) |

Interpretation of the archetypes | Maximal | Minimal | Basic | Non-sportive | ||

Traditional | Modern | Wellness | Cultural |

Here we present the six archetypes solution. Table 2 lists the archetypal profiles (i.e., the probability of positive response) and, in parentheses, the corresponding archetypal observations (with maximum *w* value). Archetype A1 is the maximal winter tourist who is engaged in nearly every sportive and wellness activity with high probability. Archetype A3, on the other hand, is the minimal winter tourist who is only engaged in alpine skiing and having dinner. Both archetypes A5 and A4 are engaged in the basic Austrian winter activities (alpine skiing, indoor swimming, and relaxing). In addition, A5 is engaged in traditional activities (dinner and shopping), whereas A4 is engaged in more modern activities (snowboarding and going to a disco). Finally, A6 and A2 are the non-sportive archetypes. A6 is engaged in wellness activities and A2 with cultural activities. Note that important engagements of the archetypes are missed if one only looks at the archetypal observations rather than the archetypal profiles; e.g., the possible engagement of A2 in hiking. We can now utilize the factors \(\mathbf{H }\) for each of the tourists to learn their relations to the archetypal winter tourist profiles. This allows us, for example, to target very specific advertising material to tourists for the next winter season.

For comparison purpose, we also compute the six archetypes solution with the original archetypal analysis algorithm (details and results can be found online). The classical algorithm finds archetypes with a similar intrepretation. However, we observe that the probabilistic archetypal profiles are “more extreme” and therefore easier to interpret.

### 7.4 Poisson observations: disasters worldwide from 1900–2008

We present the seven archetypes solution; Figure 13 shows a summary. There are two minimal profiles A1 and A7 with small differences in the categories extreme temperature/flood and storm. A1 can be considered as the archetypal profile for safe country where the corresponding archetypal observations include Malta and the Cayman Islands (other close observations are the Nordic countries). Archetype A5 is the maximal archetypal profile with counts in every category, and the corresponding archetypal observations include China and United States. This can be expected from the size and population of the countries; China (third and first), USA (fourth and third). Other countries with high factor \(\mathbf{H }\) for this archetypal profile are India (seventh and second), and Russia (first and ninth). A3 and A4 are the archetypes that are affected by drought and epidemic; where A3 additionally has a high insect infestation count. A2 is the archetype that is susceptible to complex disasters (where neither nature nor human is the definitive cause) only, whereas A6 has high counts in the categories earthquake, flood, mass movement wet, and volcano. Here the archetypal countries include Indonesia and Colombia.

For comparison purpose, we also compute the seven archetypes solution with the original archetypal analysis algorithm (details and results can be found online). Both solutions are similar. However, we observe that the probabilistic solution is more robust towards outliers. This can be observed from the two simplex visualizations (available online). Both solutions go towards the outliers, however, the classical solution is more sensitive towards it compared to the probabilistic solution (e.g., PAA6 vs. AA6 and PAA5 vs. AA7). The robustness provides more freedom to explain the other observations and increases the interpretability.

## 8 Discussion

Archetypal analysis expresses observations as composition of extreme values, or archetypes. Archetypes can be thought of as ideal or pure characteristics, and the goal of archetypal analysis is to find these characteristics, and to explain the available observations as combination of these characteristics. The standard formulation of archetypal analysis was suggested by Cutler and Breiman and is based on finding the approximate convex hull of the observations. Over the last decade this approach has been extensively used by researchers. But, their applications have mostly been limited to real–valued observations. In this paper, we have proposed a probabilistic formulation of archetypal analysis, which enjoys several crucial advantages over the geometric approach, including but not limited to the extension to other observation models: Bernoulli, Poisson and multinomial. We have achieved this by approximating the convex hull in the parameter space under a suitable observation model. Our contribution lies in formally extending the standard AA framework, suggesting efficient optimization tools based on majorization–minimization method, and demonstrating the applicability of such approaches in practical applications. We have also suggested improvements of the standard simplex visualization tool to better show the intricacies in the archetypal analysis solution.

The probabilistic framework provides further advantages that remain to be explored in their entirety. For example, it provides a theoretically sound approach for choosing the number of archetypes. This can be done by imposing appropriate priors over \(\mathbf{W }\) and \(\mathbf{H }\) matrices, such as a symmetric Dirichlet distribution with coefficient \(< 1\). The prior can be used to effectively shrink and expand the convex hull to fit the observations. Since the Dirichlet distribution is a natural prior for multinomial distribution, this solution can be approximated relatively easily using variational Bayes’ approach, and initial results show that this is indeed and effective approach for choosing the number of archetypes. However, this becomes a slightly trickier problem when applied to other observation models, such as normal and Poisson. We are currently working on suitable methods to solve the related optimization problems efficiently. It is worth mentioning that the convergence of the suggested algorithms is usually slower than standard factor models due to the additional constraint imposed on the loading matrix through \(\mathbf{W }\). Additionally, multiplicative update itself can be slow, and therefore faster algorithms should be investigated to scale up probabilistic archetypal analysis to larger datasets.

Another potential extension of the probabilistic framework is to tackle ordinal or Likert scale variables. Since ordinal variables lack additivity, they must be addressed through a probabilistic set-up with suitable observation model. Given the fact that survey data is often in Likert scale, archetypal analysis of such observations can have a large impact on social science and marketing: describe, e.g., the personality of consumers in terms of the personality of the most “extreme”, i.e., archetypal, consumers (using, e.g., the Likert scaled items defined by the Big Five Inventory). We believe that these suggested improvements will make archetypal analysis more robust and accessible to non-scientific users.

## Notes

### Acknowledgments

The calculations presented above were performed using computer resources within the Aalto University School of Science “Science-IT” project.

### References

- Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
- Bauckhage, C., & Thurau, C. (2009). Making archetypal analysis practical. In
*Pattern recognition, lecture notes in computer science, vol. 5748*, Springer, Berlin Heidelberg, pp. 272–281. doi:10.1007/978-3-642-03798-6_28. - Bhattacharya, A., & Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data.
*Journal of the American Statistical Association*,*107*(497), 362–377. doi:10.1080/01621459.2011.646934.MATHMathSciNetCrossRefGoogle Scholar - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.
*Journal of Machine Learning Research*,*3*, 993–1022.MATHGoogle Scholar - Chan, B. H. P., Mitchell, D. A., & Cram, L. E. (2003). Archetypal analysis of galaxy spectra.
*Monthly Notices of the Royal Astronomical Society*,*338*(3), 790–795. doi:10.1046/j.1365-8711.2003.06099.x.CrossRefGoogle Scholar - Cutler, A., & Stone, E. (1997). Moving archetypes.
*Physica D: Nonlinear Phenomena, 107*(1), 1–16. doi:10.1016/S0167-2789(97)84209-1, http://www.sciencedirect.com/science/article/pii/S0167278997842091. - Cutler, A., & Breiman, L. (1994). Archetypal analysis.
*Technometrics*,*36*(4), 338–347.MATHMathSciNetCrossRefGoogle Scholar - Davis, T., & Love, B. C. (2010). Memory for category information is idealized through contrast with competing options.
*Psychological Science*,*21*(2), 234–242. doi:10.1177/0956797609357712.CrossRefGoogle Scholar - Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. In
*Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining*, pp. 126–135. doi:10.1145/1150402.1150420. - Ding, C. H. Q., Li, T., & Jordan, M. I. (2010). Convex and semi-nonnegative matrix factorizations.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*(1), 45–55.CrossRefGoogle Scholar - Dolnicar, S., & Leisc, F. (2004). Segmenting markets by bagged clustering.
*Australasian Marketing Journal*,*12*(1), 51–65.CrossRefGoogle Scholar - do Nascimento, J. M. P., & Dias, J. M. B. (2005). Vertex component analysis: A fast algorithm to unmix hyperspectral data.
*IEEE Transactions on Geoscience and Remote Sensing*,*43*(4), 898–910. doi:10.1109/TGRS.2005.844293.CrossRefGoogle Scholar - Dolnicar, S., Grün, B., & Leisch, F. (2011). Quick, simple and reliable: Forced binary survey questions.
*International Journal of Market Research*,*53*(2), 231–252. doi:10.2501/IJMR-53-2-231-252.CrossRefGoogle Scholar - EM-DAT (2013). The OFDA/CRED international disaster database. Universite catholique de Louvain, Brussels, Belgium; http://www.emdat.net.
- Eugster, M. J. A., & Leisch, F. (2013). archetypes: Archetypal analysis. http://CRAN.R-project.org/package=archetypes, R package version 2.1-2.
- Eugster, M. J. A., & Leisch, F. (2011). Weighted and robust archetypal analysis.
*Computational Statistics and Data Analysis*,*55*(3), 1215–1225. doi:10.1016/j.csda.2010.10.017.MathSciNetCrossRefGoogle Scholar - Eugster, M. J. A. (2012). Performance profiles based on archetypal athletes.
*International Journal of Performance Analysis in Sport*,*12*(1), 166–187.Google Scholar - Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the beta-divergence.
*Neural Computation*,*23*(9), 2421–2456.MATHMathSciNetCrossRefGoogle Scholar - Friendly, M. (2000).
*Visualizing categorical data*. Cary, NC: SAS Institute.Google Scholar - Hahsler, M., & Hornik, K. (2007). TSP—infrastructure for the traveling salesperson problem.
*Journal of Statistical Software*,*23*(2), 1–21. http://www.jstatsoft.org/v23/i02/. - Hofmann, T. (1999). Probabilistic latent semantic indexing. In
*Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval*. ACM, New York, NY, USA, SIGIR ’99, pp. 50–57. doi:10.1145/312624.312649 - Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In
*Advances in neural information processing systems, vol. 13*, pp 556–562.Google Scholar - Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization.
*Nature*,*401*(6755), 788–791. doi:10.1038/44565.CrossRefGoogle Scholar - Li, S., Louviere, J., Carson, R., & Wang, P. (2003). Archetypal analysis: A new way to segment markets based on extreme individuals. In
*A celebration of ehrenberg and bass: Marketing knowledge, discoveries and contribution. Proceedings of the ANZMAC 2003 conference*. http://epress.lib.uts.edu.au/research/handle/10453/2183. - Marinetti, S., Finesso, L., & Marsilio, E. (2007). Archetypes and principal components of an IR image sequence.
*Infrared Physics & Technology*,*49*(3), 272–276. doi:10.1016/j.infrared.2006.06.017, http://www.sciencedirect.com/science/article/pii/S1350449506000910. - Mohamed, S., Heller, K. A., & Ghahramani, Z. (2009). Bayesian exponential family PCA. In
*Advances in Neural Information Processing Systems, vol. 21*, pp 1089–1096.Google Scholar - Mørup, M., & Hansen, L. K. (2012). Archetypal analysis for machine learning and data mining.
*Neurocomputing*,*80*, 54–63. doi:10.1016/j.neucom.2011.06.033.CrossRefGoogle Scholar - Porzio, G. C., Ragozini, G., Vistocco, D. (2008). On the use of archetypes as benchmarks.
*Applied Stochastic Models in Business and Industry*,*24*(5), 419–437. doi:10.1002/asmb.727, http://onlinelibrary.wiley.com/doi/10.1002/asmb.727/abstract. - Seiler, C., & Wohlrabe, K. (2013). Archetypal scientists.
*Journal of Informetrics*,*7*(2), 345–356. doi:10.1016/j.joi.2012.11.013.CrossRefGoogle Scholar - Sifa, R., & Bauckhage, C. (2013). Archetypical motion: Supervised game behavior learning with archetypal analysis. In:
*2013 IEEE conference on computational intelligence in games (CIG)*, pp. 1–8. doi:10.1109/CIG.2013.6633609. - Steinley, D. (2006). K-means clustering: A half-century synthesis.
*British Journal of Mathematical and Statistical Psychology*,*59*(1), 1–34. doi:10.1348/000711005X48266.MathSciNetCrossRefGoogle Scholar - Stone, E., & Cutler, A. (1996). Archetypal analysis of spatio-temporal dynamics.
*Physica D: Nonlinear Phenomena*,*90*(3), 209–224. doi:10.1016/0167-2789(95)00244-8.MATHMathSciNetCrossRefGoogle Scholar - Thøgersen, J. C., Mørup, M., Damkiær, S., Molin, S., & Jelsbak, L. (2013). Archetypal analysis of diverse pseudomonas aeruginosa transcriptomes reveals adaptation in cystic fibrosis airways.
*BMC Bioinformatics*,*14*(1), 279. doi:10.1186/1471-2105-14-279, http://www.biomedcentral.com/1471-2105/14/279/abstract. - Thurau, C., Kersting, K., & Bauckhage, C. (2009). Convex non-negative matrix factorization in the wild. In
*Ninth IEEE international conference on data mining, 2009. ICDM ’09*, pp. 523–532. doi:10.1109/ICDM.2009.55. - Thurau, C., Kersting, K., & Bauckhage, C. (2010). Yes we can: Simplex volume maximization for descriptive web-scale matrix factorization. In:
*Proceedings of the 19th ACM international conference on information and knowledge management*, ACM, New York, NY, USA, CIKM ’10, pp. 1785–1788. doi:10.1145/1871437.1871729. - Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength.
*Journal of Computational and Graphical Statistics*,*14*, 511–528.MathSciNetCrossRefGoogle Scholar - Woodbury, M. A., & Clive, J. (1974). Clinical pure types as a fuzzy partition.
*Journal of Cybernetics*,*4*(3), 111–121. doi:10.1080/01969727408621685.CrossRefGoogle Scholar - Xiong, Y., Liu, W., Zhao, D., & Tang, X. (2013). Face recognition via archetype hull ranking. In
*2013 IEEE international conference on computer vision (ICCV)*, pp. 585–592. doi:10.1109/ICCV.2013.78. - Yang, Z., & Oja, E. (2012). Clustering by low-rank doubly stochastic matrix decomposition. arXiv:12064676 http://arxiv.org/abs/1206.4676.