1 Introduction

Shape analysis is an extensive area of research since there are many applications in which it is interesting to quantify the difference between shapes, register shapes, see how one shape deforms into another, calculate the mean shape, or classify and cluster different shapes. In the case of planar shapes, one way to approach their analysis is to consider landmarks on the boundary curve of the object of interest to characterize its shape [1]. However, choosing the appropriate landmarks for each shape and registering these points between different shapes is not always an easy task; therefore, the function that defines the boundary curve of the object can be used to characterize its shape. Then, every plane shape is characterized by a parameterized curve and the set of all plane shapes is a Riemannian manifold with the appropriate metric, see for instance [2]. Although a certain parameterization of the boundary curve can be fixed for each plane shape (for example, parameterization by arc length), in elastic shape analysis, the metric must be preserved under reparameterizations of the curve. In [3], the authors use the square-root velocity functions (SRVF) and propose an elastic metric for the study of shapes represented by curves, which is valid not only for plane curves but also for curves in Euclidean \(n-\)dimensional spaces. Depending on the applications, it may be interesting for the metric to be invariant with respect to translations, rotations, and scaling as well; in this case, the feature space is called the shape space. However, in many applications (for example, in the analysis of anatomical structures in medicine), it is also important to distinguish the size of curves, in addition to shape. In this second case, the feature space is called the shape and size space. Elastic metrics for the shape space and the shape and size space can be found in [4]. Furthermore, in [5], the authors propose a metric that distinguishes between the contribution of the difference in shape and the difference in size of two elements in the shape and size space.

On the other hand, elastic shape analysis of curves has been used in different data science problems, such as principal component analysis (PCA) [4, 6, 7], cluster analysis (CLA) [4, 8, 9], classification [4, 9, 10], and outlier detection [5, 6, 11, 12]. However, until now, elastic shape analysis has not been used in archetypal analysis.

Archetype analysis (AA) was defined by [13]. It is an exploratory unsupervised statistical learning technique [14] that lies between two well-known techniques, PCA and CLA [15]. The objective of AA is to express the data as a mixture (convex combination) of a set of archetypes. The archetypes are also a mixture of data points. Both these facts make the results of AA easy to understand, even for non-experts. Note that the interpretation of mixtures is simple, unlike the linear combinations of variables that form the factors of the PCA. Although the centers of CLA are also easy to understand, its modeling flexibility is diminished by the fact that data can only be assigned to one cluster, unlike AA. There is also another fact that stimulates the human comprehension of the results of AA versus other unsupervised techniques: archetypes are extreme or pure profiles, and human beings understand them better than central points since we interpret opposite components better [16, 17]. Archetypes in Statistics have the same meaning as in everyday life [18].

Vinué [19] defined archetypoid analysis (ADA), which is a variant of AA. In ADA, the archetypal profiles are concrete observations. This is very useful for our case, as will be explained later since we are not dealing with multivariate data but with curves.

Visualization is an important task in exploratory analysis since it allows us to discover unrevealed aspects of the data. In our case, this is even more important since curves are complex data [11]. ADA not only allows us to display the main features of the data set by its extremes (archetypoids) but also allows us to explore data and extract information through the approximation of the data as a mixture of the archetypoids. ADA makes it possible to see and describe the whole data set through only a small set of representative observations that are easy to understand [18]. Note that considering extreme curves to display the main characteristics of a data set of curves was proposed by [20], who considered extreme principal component scores. However, the goal of PCA is not to find extreme cases as it is for ADA. Therefore, the cases with extreme PCA scores do not necessarily correspond to archetypal cases [13]. Even if all PCs were taken into account, archetypes could not be recovered [21].

AA and ADA are applied in many diverse fields, such as biology [22], computer vision [23,24,25,26,27,28], education [29, 30], engineering [31], genetics [32], machine learning problems [15], market research [17], neuroscience [33,34,35], psychology [36],and sports [37, 38]. Since the proposal by [21], AA has become a standard in the accommodation problem in industrial design [39], where extreme cases are searched to give designers an efficient way to develop and assess a product design. The designer considers a small set of boundary cases so that if the design fits well for those cases, it will also fit well for the not so extreme cases. However, the accommodation problem has not restricted to the multivariate case, but it has also been applied in other cases, such as shapes with landmarks [40, 41].

In this paper, we consider ADA when the metrics defined in [4] and [5] for the shape and size space are considered. This is the first time that ADA has been used in elastic shape analysis. ADA has been used before in shape analysis but with landmarks [40, 41], working in the tangent space. Our motivating problem is an accommodation problem with curves; therefore,we need to find archetypal curves. The main contributions of this work are as follows: to propose the first methodology to obtain archetypal curves in the shape and size space and to analyze their use with simulated data and to apply it to a real problem. Furthermore, the code is made available.

The outline of the paper is as follows. On the one hand, Sect. 2 reviews the SRVF representation of curves and the elastic metrics. On the other hand, several multivariate statistical methods are reviewed: a multidimensional scaling procedure and AA and ADA for the multivariate case. The proposed methodology for finding archetypal curves is introduced in Sect. 3. Sections 4 and 5 discuss the results when the new methodology is applied to a simulated data set and a real data set, respectively. Finally, some conclusions are given in Sect. 6.

2 Background

2.1 Elastic Metrics

For the study of shapes, or shapes and sizes, described by curves, the curve is usually represented in a specific metric space. Since a metric space is endowed with a distance function, when someone wishes to compare two shapes, or two shapes and sizes, of two domains bounded by their respective curves, it is necessary to compute the distance of the two points that represent such curves.

The classical metric spaces to represent curves are sub-spaces of the Hilbert space of functions \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\), i.e., the space of square integrable functions from \([0,1]\subset {\mathbb {R}}\) to \({\mathbb {R}}^n\) . For example, in the square-root velocity function (SRVF) approach, every parameterized curve \(\beta :[0,1]\rightarrow {\mathbb {R}}^n\) is represented by

$$\begin{aligned} \beta (t)\mapsto q(t)=\frac{\beta '(t)}{\sqrt{\vert \beta '(t)\vert }}. \end{aligned}$$

It is easy to check that q(t) associated with the curve \(\beta :[0,1]\rightarrow {\mathbb {R}}^n\) belongs to \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) because

$$\begin{aligned} \Vert q\Vert ^2_{{\mathbb {L}}^2}=\int _0^1 \vert q(t)\vert ^2dt=\textrm{length}(\beta )<\infty . \end{aligned}$$

Moreover, it is interesting to remark here that the space of curves in \({\mathbb {R}}^n\) and the space of functions \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) can actually be identified because not only is each curve represented in the space \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) but every function \(q(t)\in {\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) can be associated with the following curve

$$\begin{aligned} \beta (t)=\int _0^tq(s)\vert q(s)\vert ds. \end{aligned}$$

By using this identification between the space of parameterized curves and the Hilbert space \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\), we can use the distance \(\textrm{d}_{{\mathbb {L}}_2}\) in \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) as a distance between curves. Remember that given two points \(q_1\) and \(q_2\) in \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\), since \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) is a Hilbert space, we can use its inner product to obtain the distance from \(q_1\) to \(q_2\) as

$$\begin{aligned} \begin{aligned} \textrm{d}_{{\mathbb {L}}^2}(q_1,q_2):=&\Vert q_2-q_1\Vert =\sqrt{\langle q_2-q_1,q_2-q_1\rangle _{{\mathbb {L}}^2} }\\ =&\sqrt{\int _0^1 (q_2(t)-q_1(t))^2dt}. \end{aligned} \end{aligned}$$

When we think about a curve as a geometric object embedded or immersed in \({\mathbb {R}}^n\), we must conclude that the previous distance is not good enough to characterize the space of curves. This is mainly because this distance is not invariant under reparameterizations. Let us briefly recall the concept of reparameterization. Given a curve \(\gamma :[0,1]\rightarrow {\mathbb {R}}^n\) and given a diffeormorphism \(\phi :[0,1]\rightarrow [0,1]\) such that \(\phi (0)=0\) and \(\phi (1)=1\), we will say that the curve

$$\begin{aligned} \beta :[0,1]\rightarrow {\mathbb {R}}^n,\quad t\mapsto \beta (t):=\gamma (\phi (t)) \end{aligned}$$

is the reparameterization by \(\phi \) of the curve \(\alpha \). The set of reparameterizations with the composition law has a group structure, and we will denote it as \(\Gamma \). A required condition for a distance to be useful in order to compare curves is, therefore, to be invariant under the action of the group of reparameterizations. These invariant metrics are called elastic metrics.

By using the identification of the space of curves with \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\), the action of the group \(\Gamma \) on the space of curves naturally induces an action on \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) given by

$$\begin{aligned} \Gamma \times {\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\rightarrow {\mathbb {L}}^2([0,1],{\mathbb {R}}^n),\quad (\phi ,q)\mapsto \phi (q)(t):=q(\phi (t))\sqrt{\phi '(t)}. \end{aligned}$$

Since two rotated curves represent the same bounded shape, we have to take care of the group of rotations as well. The group of rotations SO(n) acts by matrix multiplication on the space of curves, and by again using the identification between the space of curves and \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\), there is a global action of the group \(SO(n)\times \Gamma \) on \({\mathbb {L}}^2([0,1],{\mathbb {R}}^n)\) given by

$$\begin{aligned} \begin{array}{rcl} SO(n)\times \Gamma \times {\mathbb {L}}^2([0,1],{\mathbb {R}}^n)&{}\rightarrow &{} {\mathbb {L}}^2([0,1],{\mathbb {R}}^n),\\ (O,\phi ,q)&{}\mapsto &{} (O,\phi )(q)(t):=O(q(\phi (t)))\sqrt{\phi '(t)}. \end{array} \end{aligned}$$

Then, the space of interest, the shape and size space, will be represented as the orbit space

$$\begin{aligned} {\mathcal {S}}:={\mathbb {L}}^2([0,1],{\mathbb {R}}^n)/SO(n)\times \Gamma . \end{aligned}$$

In order to simplify the notation, let us denote this as \(G:=SO(n)\times \Gamma \). Then, given \(q\in {\mathbb {L}}^2([0,1],{\mathbb {R}}^n) \) the associated orbit \([q]\in {\mathcal {S}}\) will be obtained as

$$\begin{aligned}{}[q]:=G.q=\left\{ (O,\phi )(q)\,:\, (O,\phi )\in G\right\} . \end{aligned}$$

Classically, see [4], the distance between orbits has been used to define an elastic metric in \({\mathcal {S}}\) as

$$\begin{aligned} \begin{aligned} \textrm{d}_2\left( [p],[q]\right) :=&{\inf _{(O',\phi '),(O'',\phi '') \in G}\textrm{d}_{{\mathbb {L}}^2}\left( (O',\phi ')p,(O'',\phi '')(q)\right) }\\ =&{\inf _{(O',\phi '),(O'',\phi '') \in G}\textrm{d}_{{\mathbb {L}}^2}\left( (O',\phi ')\left( p,(O',\phi ')^{-1}(O'',\phi '')(q)\right) \right) }\\ =&\inf _{(O,\phi )\in G}\textrm{d}_{{\mathbb {L}}^2}\left( p,(O,\phi )(q)\right) . \end{aligned} \end{aligned}$$

This is a distance in the shape and size space. In order to “take away” the length of the curves, normalized curves of length 1 can be used instead as

$$\begin{aligned} \textrm{d}_4\left( [p],[q]\right) :=\inf _{(O,\phi )\in G}\cos ^{-1}\left( \left\langle \frac{p}{\Vert p\Vert _{{\mathbb {L}}^2}},\, (O,\phi )\left( \frac{q}{\Vert q\Vert _{{\mathbb {L}}^2}}\right) \right\rangle _{{\mathbb {L}}^2}\right) . \end{aligned}$$

In this paper, we focus on distances in the shape and size space as \(\textrm{d}_2\). Recently, [5] proposed a new distance given by

$$\begin{aligned} \textrm{d}_{{4s}}([p],[q]):=\sqrt{\textrm{d}_4^2\left( \frac{p}{\Vert p\Vert _{{\mathbb {L}}^2}},\frac{q}{\Vert q\Vert _{{\mathbb {L}}^2}}\right) +\ln ^2\left( \frac{\Vert p\Vert _{{\mathbb {L}}^2}}{\Vert q\Vert _{{\mathbb {L}}^2}}\right) }. \end{aligned}$$

This new distance is scale invariant in the sense that for any \(\lambda \ne 0\)

$$\begin{aligned} \textrm{d}_{{4s}}([p],[q])=\textrm{d}_\mathrm{{4s}}([\lambda p],[\lambda q]). \end{aligned}$$

Moreover, in some cases, see Figures 6 and 7 of [5], for instance, when authors deal with a set of curves with a very large range of lengths, the new \(\textrm{d}_\mathrm{{4s}}\) distance could improve the \(\textrm{d}_2\) distance. Both distances are compared in clustering setting by [5].

2.2 Multidimensional Scaling

Let \({\varvec{D}}\) be the \(m \times m\) matrix containing the observed dissimilarity from the object p to the object q.

Let us recall that a distance matrix \({\varvec{D}}\) is Euclidean if and only if \({\varvec{B}}\) is positive semidefinite [42, Theorem 14.2.1], where \({\varvec{B}}\) = \(({\varvec{I}} - m^{-1}\varvec{ee}'){\varvec{M}} ({\varvec{I}} - m^{-1}\varvec{ee}')\), \({\varvec{M}}\) is a matrix with elements \(m_{pq}\) = -0.5* \(d_{pq}^2\), \({\varvec{I}}\) is the \(m \times m\) identity matrix, and \({\varvec{e}}\) is the \(m \times 1\) vector with all its elements equal to unity.

If the distances are Euclidean distances, they can be represented exactly in at most \(m - 1\) dimensions [42, Theorem 14.4.1] by means of classical multidimensional scaling (cMDS) [43]. The objective of cMDS is to return a set of points such that the distances between them are approximately equal to the original distances since the dimension of the space that the data are to be projected in is usually less than \(m - 1\). On the other hand, if the distances are not Euclidean, we can use cMDS as an approximation, which is optimal for a kind of discrepancy measure [42, Theorem 14.4.2]. However, it is possible to use h-plot, an alternative methodology proposed by [44, 45], which even works when the dissimilarity is not a distance. The aim of the h-plots is not to conserve the interpoint distances exactly but to preserve relationships between dissimilarity variables. This perspective is especially useful when the distance is not Euclidean, as in this case for \(d_2\), \(d_4\), \(d_{4s}\), as the distances cannot be projected exactly in a Euclidean space. There are negative eigenvalues in the respective \({\varvec{B}}\) matrices (see Sect. 4 and Sect. 5.1 for examples). Note that the original spaces are not Euclidean, so neither are the distances.

2.2.1 h-Plot for MDS

\(\textbf{D}\) is treated as a data matrix with h-plot, where each variable \(d_{q.}\) measures the distance from q to other objects. (In case of asymmetrical relationships, variables measuring the distance from an object to q, \(d_{.q}\), should be also considered). The variance–covariance matrix of \(\textbf{D}\), \(\textbf{S}\), is computed. \(\textbf{S}\) is always positive semidefinite and we solve the eigenvalue problem. Let \(\lambda _1\) and \(\lambda _2\) denote the two largest eigenvalues and \(q_1\) and \(q_2\) the corresponding unit eigenvectors. Then, the h-plot in two dimensions is \(H_2 = (\sqrt{\lambda _1} q_1, \sqrt{\lambda _2} q_2).\) Analogously, it can be defined for higher dimensions.

The Euclidean distance between the rows \(h_p\) and \(h_q\) is approximately the sample standard deviation of the difference between variables \(d_{p.}\) and \(d_{q.}\). Therefore, if these variables are similar, their difference and, as a consequence, the standard deviation of their difference will be small and they will be represented near to each other and vice versa. If the scale of the distances is linearly modified, the obtained configuration does not change, only the scale of the axes is modified. The goodness of fit can be easily assessed by \((\lambda _1^2 + \lambda _2^2)/\sum _j \lambda _j^2\), where a high measure, close to 1, indicates a better fit. H-plots were compared with eleven methods by [44], with very good performance, even for asymmetrical relationships [45].

2.2.2 Congruence Coefficient

The best method to asses configurations is pictures [46, sec. 19.7]. However, we can use the congruence coefficient (CC), a correlation coefficient about the origin, to approximately assess the configurational similarity of two configurations \(C_1\) and \(C_2\). In configuration \(C_1\) (\(C_2\)), the dissimilarity between ith and jth objects is \(d_{ij}(C_1)\) (\(d_{ij}(C_2)\)).

CC is defined for symmetric dissimilarity matrices:

$$\begin{aligned} CC = \frac{\sum _{i<j}d_{ij}(C_1)d_{ij}(C_2)}{(\sum _{i<j}d_{ij}^2(C_1))^{1/2}(\sum _{i<j}d_{ij}^2(C_2))^{1/2}} \end{aligned}$$

CC ranges from 0 to 1. If \(C_1\) and \(C_2\) are perfectly similar geometrically, CC = 1. We say that two configurations are similar when they can be brought to a complete match by rigid motions and dilations.

In the experimental sections, the configuration \(C_1\) provided by distances \(d_2\) and \(d_{4s}\) is compared with the configuration \(C_2\) obtained after projecting by h-plot, using the Euclidean distance for computing the interpoint distances.

The idea of assessing the goodness of approximations by means of correlation between distances has been also used elsewhere. For example, [47] used the correlation between the Procrustes distances and the Euclidean distances in the tangent space in shape statistics with landmarks.

2.3 Archetypal Analysis

Let us review ADA and AA in the multivariate case. Let \(\textbf{X}\) = (\(\textbf{x}_1\), ..., \(\textbf{x}_m\)) be an \(m \times r\) data matrix with m observations and r variables.

In AA, we search for k archetypes, which are mixtures of observations, i.e., a \(k \times r\) matrix \(\textbf{Z}\) = (\(\textbf{z}_1\), ..., \(\textbf{z}_k\)), whose mixture approximates each row \(\textbf{x}_i\)

$$\begin{aligned} \textbf{x}_i \sim \hat{\textbf{x}}_i = \sum _{j=1}^k \alpha _{ij} \textbf{z}_j. \end{aligned}$$
(1)

The \(m \times k\) matrix \(\mathbf {\alpha }\) = (\(\alpha _{ij}\)) contains the approximator mixture coefficients, while the \(k \times m\) matrix \(\mathbf{\beta }\) = (\(\beta _{jl}\)) contains the constructor mixture coefficients, i.e.,they build the archetypes according to:

$$\begin{aligned} \textbf{z}_j = \displaystyle \sum _{l=1}^m \beta _{jl} \textbf{x}_l. \end{aligned}$$
(2)

To find the matrices \(\mathbf{\alpha }\) and \({ \mathbf \beta }\) and, therefore, \(\textbf{Z}\), we should minimize the following residual sum of squares (RSS), where ( \(\Vert \cdot \Vert \) denotes the Frobenius matrix norm for matrices and the Euclidean norm for vectors):

$$\begin{aligned} RSS = \displaystyle \Vert \mathbf{{X}} - \mathbf{{\alpha \beta X}} \Vert ^2 = \sum _{i=1}^m \left\| {\textbf{x}}_i - \sum _{j=1}^k \alpha _{ij} { \textbf{z}}_j\right\| ^2 = \sum _{i=1}^n \left\| {\textbf{x}}_i - \sum _{j=1}^k \alpha _{ij} \sum _{l=1}^n \beta _{jl} {\textbf{x}}_l\right\| ^2{,} \nonumber \\ \end{aligned}$$
(3)

under the restrictions

  1. (1)

    \(\displaystyle \sum _{j=1}^k \alpha _{ij} = 1\) with \(\alpha _{ij} \ge 0\) and \(i=1,\ldots ,m\), \(j=1,\ldots ,k\) and

  2. (2)

    \(\displaystyle \sum _{l=1}^m \beta _{jl} = 1\) with \(\beta _{jl} \ge 0\) and \(j=1,\ldots ,k\) and \(l=1,\ldots ,m\).

In ADA, we search for k archetypoids, which are actual observations of the data set. Therefore, the minimization problem is similar to that of AA but restriction 2) is replaced by:

2) \(\displaystyle \sum _{l=1}^m \beta _{jl} = 1\) with \(\beta _{jl} \in \{0,1\}\) and \(j=1,\ldots ,k\) and \(l=1,\ldots ,m\). In this way, \(\beta _{jl}=1\) for one and only one l, otherwise \(\beta _{jl}=0\).

In ADA, as in AA, each \(\alpha _{ij}\) returns the weight of the archetypoid \(\textbf{z}_j\) for the observation \(\textbf{x}_i\); that is to say, the \(\alpha \) approximator coefficients indicate how much each archetypoid contributes to the approximation of each observation.

Archetypes are located on the boundary of the convex hull if \(k > 1\), while it is the mean if k = 1 [13]. Archetypoids are not necessarily on the boundary of the convex hull if \(k > 1\), and it is the medoid if k = 1 [19]. The medoid is the observation for which the average dissimilarity between it and all the other observations is minimal. Therefore, it is the most centrally located observation. Note that medoids are the elements of the data set.

In this paper, we will focus on ADA, as explained in Sect. 3.

2.3.1 Toy Example

We use a two-dimensional data set to clarify the meaning of ADA and their differences with PCA and CLA. We consider two variables of 381 right feet of Spanish women: the Foot Length (FL) and Ball Girth (BG). Details about the data set and variables are provided in Sect. 5. PCA, k-means and ADA with k = 3 are applied to standardized data. Figure 1 shows the results.

Fig. 1
figure 1

Results for FL and BG. a k-means with cluster assignments to each centroid represented by blue triangles. b ADA with assignments to the maximum alpha. Archetypoids are represented by blue crosses. c PC projected k-means results. d PC projected ADA results

Archetypoids are extreme feet. The first archetypoid has very low FL and BG values, the second archetypoid is characterized by a very high value for BG, but a medium value for FL, while archetypoid 3 is characterized by a very high FL value, but a medium–high value for BG. The rest of the feet are explained by mixtures of these archetypal feet. For instance, a foot with values of 254 and 254 for FL and BG, respectively, is described by 64% of archetypoid 2 plus 36% of archetypoid 3. k-means does not return this kind of information, it only indicates the cluster assignments.

Centroids of k-means are in the middle of the data cloud, with less extreme dimensions than archetypoids. Therefore, archetypoids are more easily interpretable. Centroids have more uniform shapes, and they have the same FL and BG ratio as the mean foot. This does not happens with archetypoids. We can visualize this in the PC projections. The first PC is a size component, while the second PC is a shape component, since the loadings are 0.7 and 0.7 for the 1st PC and 0.7 and -0.7 for the 2nd PC. Centroids are found in the zero horizontal line, while archetypoids are found near the border of the PC score space, although they are not the cases with the most extreme PC scores.

3 Methodology

In our problem, we do not have variables in \({\mathbb {R}}^r\), but we have the distances between the curves. When variables are unavailable, we can follow the strategy explained by [19] for finding archetypoids. The idea is to project the distances into a certain space \({\mathbb {R}}^r\) and find the archetypoids in that space. Note that by using archetypoids, as they are actual observations, we can determine the concrete curves in the original space. In this way, we can also visualize the archetypal curves. This would not be possible with archetypes since we cannot obtain a mixture of curves. However, we have the \(\alpha \) coefficients, expressing the contribution of each original archetypoid to each original curve. The idea of projecting to an approximating linear space, when the original space is not vectorial, and working on that space, is widely used in Statistics [48].

The scheme of the procedure is as follows.

1. Compute \(\textbf{D}\), which is the \(m \times m\) matrix where \(d_{pq}\) denotes the distance between the curves [p] and [q].

2. Use a multidimensional scaling method (MDS) to find a representation in \({\mathbb {R}}^r\) that conserves the pairwise distances, i.e., the information contained in \(\textbf{D}\), in some way. According to the method, a goodness of fit measure can be used to select r.

3. Calculate the archetypoids of the \(m \times r\) matrix \(\textbf{X}\), the matrix obtained by the MDS method. This matrix contains the coordinates of the points estimated to represent the distances.

Regarding to \(\textbf{D}\), \(d_2\) or \(d_{{4s}}\) are used in our case. As regards the MDS method, we consider h-plot in this work. It was used previously with good results in ADA when variables are not available [18, 19, 49, 50]. We want to emphasize that the scheme is flexible. Other choices can be selected for estimating \(\textbf{D}\) and for MDS.

3.1 Computational Details

As regards the implementation of the methods, the code is available in Section Code Availability. For computing \(\textrm{d}_2\) and \(\textrm{d}_{{4s}}\) distances, we use the implementation by [4] and [5], respectively. [5] also used the code by [4] for computing \(\textrm{d}_4\). For obtaining the infimum over orbits, two optimization methods are applied, Procrustes analysis and the dynamic programming algorithm. Full details are provided in [4, Appendix 2].

H-plot implementation is made as in [44], with princomp function for PCA in R.

To solve the mixed-integer optimization problem of ADA, [19] proposed an algorithm based on two phases: an initialization phase, called the BUILD phase, where a set of possible archetypoids are selected, and the SWAP phase, where the initial set is improved by exchanging the selected observations for unselected ones and checking if these replacements decrease the RSS. We use the R [51] implementation created by [40].

As regards the determination of the number of archetypoids, we use the elbow criterion, which has been used in previous papers, such as [13, 19, 52]. This criterion consists of displaying the RSS versus the number of archetypoids and determining the point where an elbow is found.

3.1.1 Scalability

Regarding scalability, there are two issues to consider. On the one hand, the number of sample points per curve is not a problem since the algorithm by [4] resamples the curves at the beginning to have 100 points, so the number of sample points is always constant. The number of sample points in curves is only used in the estimation of the distances.

On the other hand, let us analyze the scalability of our procedure when the number of curves is big. Let us analyze each part of the procedure. Firstly, the computation of the distances is made by each pair of curves. Therefore, it can be easily parallelized. Secondly, the h-plot method depends on the solution of an eigenvalue problem of a positive semidefinite matrix, which is a well-studied problem for large matrices [53]. Nowadays, there are even scalable methods for computing eigenvectors of non-symmetric matrices [54]. Thirdly, ADA method was made scalable by [55].

4 Application to a Simulated Data Set

We have simulated an artificial data set with 90 3D cylindric helixes, \(\beta _{i}(t)\) \(i=1,\cdots , 90\), \(t \in [0,1]\), with

$$\begin{aligned} x_{i}=a_i\cos (8\pi t);\quad y_{i}=a_i\sin (8\pi t);\quad z_{i}=b_i t; \quad i=1,\cdots , 90 \end{aligned}$$

where the parameters \(a_i\) and \(b_i\) of the helix are randomly obtained from two different probability distributions; the radius \(a_i\sim Normal(50,20)\) and \(b_i\sim Uniform(30,70)\) (so all these helixes will have different shapes and different lengths). Fig. 2 shows the values obtained in the simulations for the parameters of the 90 helixes. Fig. 3a and b show two graphical representations of the simulated helixes.

Fig. 2
figure 2

Parameters of the simulated helixes. a Values simulated for \(a_i, i=1,\cdots ,90\) b Values simulated for \(b_i, i=1,\cdots ,90\). c Scatter-plot of the values simulated for \((a_i,b_i), i=1,\cdots ,90\)

Fig. 3
figure 3

Simulated curves. a and b show the 90 simulated helixes seen from different perspectives

The distance matrices between the 90 curves, \(D_{{4s}}\) and \(D_2\), have been computed. As these distance matrices are not Euclidean (there are negative eigenvalues in \({\varvec{B}}\) for both distance matrices, see Fig. 4), h-plots can be used as MDS. The goodness-of-fit measures of h-plots explained by [44] for \(d_{{4s}}\) are \(87.94\%, 99.92\%\), and \(99.99\%\) for \(r=1,2,3\), respectively, and \(80.31\%, 99.92\%\), and \(99.97\%\) for \(d_{2}\). We use \(r = 3\) in both cases. The resulting h-plots for two dimensions (r = 2) can be seen in Fig. 5. Their CC are 95% and 97%, respectively.

Fig. 4
figure 4

3rd to 5th and 88th to 90th eigenvalues in \({\varvec{B}}\). a Using the distance \(d_{{4s}}\). b Using the distance \(d_2\)

Fig. 5
figure 5

H-plots of simulated curves. a Using the distance \(d_{{4s}}\). b Using the distance \(d_2\)

According to the elbow criteria, the screeplots shown in fig. 6 advises us to consider three (k = 3) archetypoids in each case.

Fig. 6
figure 6

Screeplot for simulated data. a Using the distance \(d_{{4s}}\). b Using the distance \(d_2\)

The archetypoids obtained with the two distances are somewhat different (Figs. 7 and 8). The values of \(a_i\) in the helixes of the data set range from 4.82 to 121.57, and the values of \(b_i\) range between 30.18 and 79. Figure 8 shows the parameters of the 90 helixes together with the parameters of the archetypoids obtained with the two distances. The first archetypoid obtained from \(d_{{4s}}\) has a very low value for \(a_i\) and a very high value for \(b_i\), while the first archetypoid from \(d_2\) has also a low value for \(a_i\) and quite a high value for \(b_i\), slightly lower than that obtained with \(d_{{4s}}\). The second and third archetypoids from \(d_{{4s}}\) have low values for \(b_i\), and intermediate and high values for \(a_i\), respectively. However, the second and third archetypoids from \(d_2\) also have intermediate and high values for \(a_i\), respectively, but medium values for \(b_i\). In summary, the parameters of the archetypoids found by \(d_{{4s}}\) are more extreme than the parameters of the archetypoids found by \(d_{2}\).

Fig. 7
figure 7

Archetypoids obtained seen from two different perspectives. In green, the archetypoids obtained with \(d_{{4s}}\), and in red, those obtained with \(d_2\)

Fig. 8
figure 8

Parameters of the archetypoids (of the helixes) obtained from the two distances. In green, the parameters \((a_i,b_i)\) of the archetypoids obtained with \(d_{{4s}}\), and in red the parameters \((a_i,b_i)\) of the archetypoids obtained with \(d_2\)

5 Application to a Real Data Set

Suitable footwear design needs to take into account the distribution of foot shape [41]. If this is not taken into account, it will not only lead to lower sales, but can also cause pain and deformity, especially in women. This is the reason why there are many studies on foot shape, such as [56,57,58,59,60,61], etc.

Comprehension of the typology and distribution of body part shapes is not only critical in the apparel industry but also in ergonomic industrial design [62, 63], as well as in other scientific disciplines that include criminalistics [64], face classification with all its fields of application (forensic anthropology, crime prevention, human-machine interaction systems like e-commerce, e-learning, games, dating, and social networks) [65, 66], medicine [67,68,69], phylogeny [70], sport [71,72,73], etc. However, it is not just restricted to anthropometry; taxonomy is also important in morphometry in general, such as in plant or animal taxonomy [74, 75] and also in genetics [76].

ADA with landmarks was used in [41] for determining foot type in the adult Spanish population. Here we have carried out a similar study, but instead of using landmarks, we use curves.

Here, we use the data from [5]. The description of the acquisition of these data is detailed in [5]. Our curves consist of the longitudinal contour of right feet passing through the Ball Position. The sample size is 770, divided into 389 and 381 right feet of Spanish adult men and women, respectively.

The medoid shapes for men and women with \(d_{{4s}}\) are displayed in Figure 9.

Fig. 9
figure 9

Medoid curves of feet for men (a) and women (b) with \(d_{{4s}}\)

5.1 Results and Discussion

In the interests of brevity and as an illustrative example, we only examine the results for \(d_{{4s}}\) although they could be carried out with \(d_2\), analogously. The matrices \({\varvec{D}}\) obtained with \(d_{{4s}}\) for men and women are not Euclidean, since the respective \({\varvec{B}}\) are not positive semidefinite; 41% of the eigenvalues are negative. Therefore, we show the results using h-plot as MDS.

The goodness-of-fit measure for h-plotting (see [44] for details) is 99% for r = 4 for both men and women (it is 84.96%, 95.44%, 97.77%, and 99.43% for r = 1, 2, 3, and 4, respectively, for men, while it is 81.89%, 91.66%, 96.92%, and 98.83% for r = 1, 2, 3, and 4, respectively, for women). Therefore, we use r = 4. The CC are 97% and 96% for men and women, respectively.

Figure 10 shows the screeplot for women and men. The elbow is found at k = 4 and k = 5, for women and men, respectively. The archetypal feet are displayed in Figure 11 and Figure 12 for women and men, respectively.

Fig. 10
figure 10

Screeplot for women (a) and men (b)

Fig. 11
figure 11

Archetypoidal feet for women

Fig. 12
figure 12

Archetypoidal feet for men

In order to describe the archetypoidal curves obtained, Tables 1 and 2 display the percentiles of the four variables that most influence shoe fit according to footwear design experts. The variables are Foot Length, FL (distance between the rear and foremost point of the foot axis); Ball Girth, BG (perimeter of the ball section); Ball Width, BW (maximal distance between the extreme points of the ball section projected onto the ground plane); and Instep Height, IH (maximal height of the instep section, located at 50% of the foot length). We also show the percentiles of the variables after removing the scale, i.e.,by dividing each of the variables by FL: BG/FL, BW/FL, and IH/FL.

Table 1 Percentiles of the main variables and divided by FL for archetypoidal feet of women
Table 2 Percentiles of the main variables and divided by FL for archetypoidal feet of men

In the case of women, the first archetypoidal foot has high percentiles for FL and IH and medium percentiles for BG and BW; the second archetypoidal foot has low percentiles for BG and IH, and medium for FL and BW; the third archetypoid has low percentiles for FL, BG,and BW, and medium for IH; and the fourth archetypoid has a low percentile for FL and medium for BG, BW,and IH. This last archetypoidal foot has an extreme shape, since it has very high percentiles (around 90) for variables BG/FL, BW/FL, and IH/FL.

In the case of men, the first archetypoidal foot has a high percentile for FL and medium percentiles for the rest of the variables (BG, BW, and IH), although the percentiles are low for the variables divided by FL; the second archetypoidal foot has a high percentile for IH; the third archetypoid has high percentiles for BG, BW, and IH; the fourth archetypoid has low percentiles for BG, BW, and IH; and the fifth archetypoidal foot has high percentiles for all four variables FL, BG, BW, and IH.

Let us see how the feet are distributed according to the archetypoidal curves. Table 3 shows the distribution of archetypoidal profiles for women and men. For each foot, we consider its \(\alpha \) coefficients, and we assign each foot to the archetypoidal profiles for which the \(\alpha \) coefficient is maximum. For example, the 4th archetypoidal profile for men is not very prevalent. A simplex visualization of the \(\alpha \) coefficients is shown in Fig. 13, using the simplexplot function of the R package archetypes [77]. In this way, we have been able to visualize the set of curves of the feet and express them as a mixture of the archetypoids.

Table 3 Distribution of feet for women and men
Fig. 13
figure 13

Simplexplot for women (a) and men (b)

We have applied multivariate ADA with standardized features FL, BG, BW,and IH, for men and women. We have considered the same number of archetypoids as with curves to check if the same results could have been achieved using the features directly instead of the curves. Table 4 shows the percentiles of the multivariate archetypoids for women and men, respectively.

The profiles returned by multivariate features and curves are somewhat different. In the case of women, the fourth archetypoid profiles are the most similar with no large differences in features FL, BG/FL, BW/FL, and IH/FL. There are some coincidences in the first archetypoid profiles, with some not too large differences in features FL, BW, and IH. This also happens with the second archetypoid profiles, with some not too large differences in features BG, IH, BG/FL, and IH/FL. The third archetypoid profiles have the largest differences. These differences are found in all the features except IH.

Table 4 Percentiles of the main variables and divided by FL for archetypoidal feet of women and men with multivariate features

As regards men, the third, fourth, and fifth archetypoid profiles are the most similar with no large differences in features BG, BW, IH, BG/FL, BW/FL, and IH/FL; FL, BG, IH, BG/FL, and IH/FL; and FL, BG, BW, and IH/FL, respectively. The largest differences are provided between the profiles of the first and second archetypoids. They only coincide in feature BG for the first archetypoid and features BG/FL and BW/FL for the second archetypoid. Therefore, the archetypal profiles returned using the richer information of curves cannot be retrieved using multivariate data. The same occurs in [41], where results with 3D landmarks and multivariate data were also compared.

6 Conclusion

We have used archetypal analysis for the first time in elastic shape analysis. We have applied ADA to the projections with MDS of two distances (\(d_2\) and \(d_{{4s}}\)). ADA has allowed us to see which the archetypal curves were and relate the rest of the curves to said archetypoids by the \(\alpha \) coefficients. As curves are complex data, exploration and visualization of the data set are simplified by ADA. We have seen the application in a real problem concerning footwear. Furthermore, our proposal is scalable.

If we wanted to find the archetypal curves and the data set has different groups, the same idea as in [20] could be considered: using a clustering algorithm to find the groups and then applying ADA to find the archetypal curves of each group.

In future work, ADA could be replaced in our methodology by robust ADA [78] when dealing with outliers. Furthermore, our methodology could be extended to irregular or sparsely sampled curves [79]. A new line of research could involve using the distances and ADA in a different data science problem, such as the detection of outlier curves by extending the idea proposed by [80]. Furthermore, the fields of application are numerous, from medicine [81] to industry [82] or computer animation [83].