1 Introduction

Huge parts of statistical theory, especially its nonparametric side, heavily rely on the notion of ranks, see for instance Gibbons and Chakraborti (2010). However, ranks are not well defined in a multivariate framework as there exists no natural ordering in more than one dimension. This fact motivated Tukey (1975) to introduce the notion of statistical depth as a surrogate for ‘multivariate ranks’. Concretely, a depth is a measure of how central (or how outlying) a given point is with respect to a multivariate probability distribution. Zuo and Serfling (2000), following some earlier considerations in Liu (1990), formulated the properties that a valid depth measure should satisfy. Since then, depth-based procedures have proved very important tools for robust multivariate statistical analyses, e.g. see Liu et al. (1999), Li and Liu (2004, 2008) or Zuo (2021). Serfling (2006) and Mosler (2013) offer excellent short reviews of the ideas surrounding the concept of depth, while Hallin et al. (2021) recently shed new light on the problem of ‘multivariate ranks’.

The early 21st century has also seen such technological progress in recording devices and memory capacity, that any spatio-temporal phenomenon can now be recorded essentially in continuous time or space, giving rise to ‘functional’ random objects. As a result, a solid theory for Functional Data Analysis (FDA) has been developed as well, allowing the extension of most of the classical problems of statistical inference from the multivariate context to the inherently infinite-dimensional functional case. In particular, functional versions of statistical depth have been investigated (Fraiman and Muniz 2001; Cuevas et al. 2007; López-Pintado and Romo 2009; Dutta et al. 2011; López-Pintado and Romo 2011; Sguera et al. 2014; Chakraborty and Chaudhuri 2014; Hlubinka et al. 2015; Nieto-Reyes and Battey 2021; Nieto-Reyes et al. 2021). It is worth noting that an infinite-dimensional environment implies specific theoretical and practical challenges, making the extension from ‘multivariate’ to ‘functional’ a non-trivial one (Nieto-Reyes and Battey 2016).

In this paper, we carry on with this gradual extension process by defining the statistical depth for complex random objects living in abstract metric spaces. Again, this extension is motivated by the rapid development of technology. Indeed, this is the ‘Big Data’ era, in which digital data is recorded everywhere, all the time. The information that this huge amount of data contain may enable next-generation scientific breakthroughs, drive business forward or hold governments accountable. However, this is conditional on the existence of a statistical toolbox suitable for such Big Data, the profusion and nature of which inducing commensurate challenges. Indeed those data consist of objects as various as high-dimensional/infinite-dimensional vectors, matrices or functions representing images, shapes, movies, texts, handwriting or speech (to cite a few); and live streaming series thereof—this is often summarised as ‘3V’ (Volume, Variety and Velocity).

Mainstream statistical techniques often fall short for analysing such complex mathematical objects. Yet, it remains true that any statistical analysis requires a sense of how close two instances of the object of interest are to one another. It is then only natural to assume that they live in a space where distances can be defined—that is, in a certain metric space (Snášel et al. 2017). This motivates the need for a statistical depth defined in an abstract metric space, as recently acknowledged by Dai and Lopez-Pintado (2022), who extended Tukey’s halfspace depth to such general setting. Our proposal of a ‘metric depth’ continues that line of research.

The idea that the concept of multivariate statistical depth could be extended to general non-Euclidean settings can be traced back to Carrizosa (1996, section 3.1). Later, Li et al. (2011) were considering a depth-based procedure for analysing abundance data, which are typically high-dimensional discrete data with many observed 0’s. Because of that particular structure, the classical Euclidean distance is not optimal for quantifying (dis)similarities between observations, and analysts in the field usually prefer more specific metrics such as the Bray-Curtis distanceFootnote 1 (Bray and Curtis 1957). In consequence, inspired by earlier works by Maa et al. (1996) and Bartoszynski et al. (1997), Li et al. (2011) devised a depth measure which could allow the proximity between observations to be quantified by a specific, user-chosen distance/dissimilarity measure.

This flexibility appears even more desirable when dealing with the polymorphous objects commonly found in modern data sets, as described above. For instance, functional objects are much richer than just infinite-dimensional vectors, and they can be compared on many different grounds: general appearance, short- or long-range variation, oscillating behaviour, etc.; which makes the choice of the ‘proximity measure’ between two such objects a very crucial one (Ferraty and Vieu 2006, Chapter 3). On a more theoretical basis, an appropriate choice of such ‘proximity measure’ sometimes allows one to get around issues caused by the ‘Curse of Dimensionality’ (Geenens 2011a).

Quantifying (dis)similarities between non-numeric objects is even more subject to discretionary choice. As an example, for comparing pieces of texts, the literature in text mining, linguistics and natural language processing proposed numerous metrics such as the Levenshtein distance, the Hamming distance, the Jaccard index or the Dice coefficient—each targetting different dimensions of words, sentences or texts, such as similarity in spelling or similarity in meaning (Wang and Dong 2020). It is, therefore, paramount to have access to statistical procedures which allow a free choice of metric, and may be tailored to the kind of data at hand and to the ultimate purpose of the analysis.

Indeed, our proposed ‘metric depth’ (\(\mu D\)), defined in Sect. 2, enables such flexible analyses. Its main properties are explored in Sect. 3 and an empirical version (computable from a sample) is described in Sect. 4. Section 5 illustrates its capabilities on several real data sets, including an application in ‘text mining’ (Sect. 5.5). Section 6 concludes.

2 Statistical depth in metric spaces: definition

Assume that the random object of interest, say \({\mathcal {X}}\), lives in a certain space \({\mathcal {M}}\) which can be equipped with a distance d. To avoid dispensable technical complications, it will be assumed throughout that \(({\mathcal {M}},d)\) is a complete and separable metric space. Let \({\mathcal {A}}\) be the \(\sigma \)-algebra on \({\mathcal {M}}\) generated by the open d-metric balls and \({\mathcal {P}}\) be the space of all probability measures defined on the Borel sets of \({\mathcal {A}}\).Footnote 2 This makes \(({\mathcal {M}},{\mathcal {A}},P)\) a proper probability space for any \(P \in {\mathcal {P}}\). In particular, it will be assumed that the distribution of \({\mathcal {X}}\) belongs to \({\mathcal {P}}\). Note that the cartesian product space \(({\mathcal {M}}\times {\mathcal {M}}, {\mathcal {A}}\times {\mathcal {A}}, P \times P)\) is then also a valid probability space (Parthasarathy 1967, Theorem I.1.10). We denote:

$$\begin{aligned} {{\mathbb {P}}}({\mathcal {S}}({\mathcal {X}}_1,{\mathcal {X}}_2)) \doteq \iint _{{\mathcal {M}}\times {\mathcal {M}}} {\mathcal {S}}(\chi _1,\chi _2)\, \textrm{d}(P \times P)(\chi _1,\chi _2)\end{aligned}$$

for any measurable statement \({\mathcal {S}}: {\mathcal {M}}\times {\mathcal {M}}\rightarrow \{0,1\}\)—the statement returns the value 1 if it is true, and 0 otherwise. So, \({{\mathbb {P}}}({\mathcal {S}}({\mathcal {X}}_1,{\mathcal {X}}_2))\) returns the probability that \({\mathcal {S}}\) is true if \({\mathcal {X}}_1, {\mathcal {X}}_2\) are two independent replications of \({\mathcal {X}}\), whose distribution is P.

Then we give the following definition:

Definition 2.1

The ‘metric depth’ (‘\(\mu D\)’) of the point \(\chi \) in the metric space \(({\mathcal {M}},d)\) with respect to the probability measure \(P \in {\mathcal {P}}\) is defined as:

$$\begin{aligned} \mu D(\chi ,P) = {{\mathbb {P}}}\big ( d({\mathcal {X}}_1,{\mathcal {X}}_2) > \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} \big ). \end{aligned}$$
(2.1)

For each fixed \(\chi \in {\mathcal {M}}\), the set \(\big \{(\chi _1,\chi _2)\in {\mathcal {M}}\times {\mathcal {M}}: d(\chi _1,\chi _2) > \max \{d(\chi _1,\chi ),d(\chi _2,\chi )\} \big \}\) belongs to the \(\sigma \)-algebra \({\mathcal {A}}\times {\mathcal {A}},\) with \({\mathcal {A}}\) defined above, making the probability statement \({{\mathbb {P}}}\) in (2.1) a well-defined one for any \(P \in {\mathcal {P}}\).

The interpretation of (2.1) in terms of depth is clear: a point \(\chi \in {\mathcal {M}}\) is deep with respect to the distribution P if it is likely to find it ‘between’ two objects \({\mathcal {X}}_1\) and \({\mathcal {X}}_2\) in \({\mathcal {M}}\) randomly generated from P. ‘Between’ here means that the side joining \({\mathcal {X}}_1\) and \({\mathcal {X}}_2\) is the longest in a ‘triangle’ of \({\mathcal {M}}\) with vertices \({\mathcal {X}}_1\), \({\mathcal {X}}_2\) and \(\chi \), or, in other words, that \(\chi \) belongs to the intersection of the two open d-balls \(B_d({\mathcal {X}}_1,d({\mathcal {X}}_1,{\mathcal {X}}_2))\) and \(B_d({\mathcal {X}}_2,d({\mathcal {X}}_1,{\mathcal {X}}_2))\), where \(B_d({\mathcal {X}}_1,d({\mathcal {X}}_1,{\mathcal {X}}_2))\) is the ball with center \({\mathcal {X}}_1\) and radius \(d({\mathcal {X}}_1,{\mathcal {X}}_2)\). In this sense, (2.1) is an extension of the vectorial ‘lens depth’ (Liu and Modarres 2011), as they both coincide when \(({\mathcal {M}},d) = ({\mathbb {R}}^p,d_2)\), where \(d_2\) is the Euclidean distance over \({\mathbb {R}}^p\), \(p\ge 1\). If we define

$$\begin{aligned} L_d({\mathcal {X}}_1,{\mathcal {X}}_2):= B_d({\mathcal {X}}_1,d({\mathcal {X}}_1,{\mathcal {X}}_2)) \cap B_d({\mathcal {X}}_2,d({\mathcal {X}}_1,{\mathcal {X}}_2)), \end{aligned}$$
(2.2)

the ‘lens’ defined by \({\mathcal {X}}_1\) and \({\mathcal {X}}_2\) in \(({\mathcal {M}},d)\), then \(\mu D(\chi ,P) = {{\mathbb {P}}}( L_d({\mathcal {X}}_1,{\mathcal {X}}_2) \ni \chi )\). This is the probability that a random set contains a certain element \(\chi \), and interesting parallels can be drawn with the theory of random sets, in particular Choquet capacities and related ideas (Molchanov 2005, Chapter 1). Note that, independently of this work, Cholaquidis et al. (2022) recently explored the extension of the ‘lens depth’ to general metric spaces as well while Cholaquidis et al. (2021) studied the associated level sets. Their focus and the content of their papers are, however, much different to what is investigated here.

Finally, it is interesting to verify that, in the particular case where \({\mathcal {M}}= {\mathbb {R}}\) and d is the usual distance \(d(\chi ,\xi ) = |\chi -\xi |\), the metric depth \(\mu D\) exactly coincides with the simplicial depth \(D_S\) of Liu (1990) (defined using open simplices):

$$\begin{aligned} \mu D(\chi ,P)&= {\mathbb {P}}( |{\mathcal {X}}_{1}-{\mathcal {X}}_{2}| > \max (|{\mathcal {X}}_{1}-\chi |,|{\mathcal {X}}_{2}-\chi |)) \\ {}&= {\mathbb {P}}(\chi \in (\min ({\mathcal {X}}_{1},{\mathcal {X}}_{2}),\max (({\mathcal {X}}_{1},{\mathcal {X}}_{2})))=D_{S}(\chi ,P). \end{aligned}$$

Note that the same occurs for the ‘lens depth’—in \({\mathbb {R}}\), where the notion of rank is unequivocally defined, it is expected that all reasonable depth measures are in agreement, indeed.

3 Main properties

The fact that the distance d is left free really makes the metric depth \(\mu D\) a very flexible tool, as any meaningful d equipping \({\mathcal {M}}\) can be used in (2.1) without altering the theoretical properties which we explore below.

In addition, we note that no-where in the developments, it is used explicitly the fact that \(d(\chi ,\xi ) = 0 \iff \chi = \xi \) for any two \(\chi ,\xi \in {\mathcal {M}}\) (identity of indiscernibles). A proximity measure which satisfies all the properties of a distance (non-negativity, symmetry and triangle inequality) but not ‘identity of indiscernibles’ is called a pseudo-distance. Hence, the metric depth (2.1) can be used in conjunction with a pseudo-distance, while keeping its essential features. We can, for instance, assess the proximity between two objects by comparing the coefficients of their leading terms when expanded in certain bases, such as a spline basis in the case of functional data when smoothing the original data is necessary (Ramsay and Silverman 2005, Chapter 3). Other examples are given in Sect. 5.

3.1 Elasticity invariance

\((P_1)\) :

Let \(\varphi : {\mathcal {M}}\rightarrow {\mathcal {M}}\) be an ‘elastic’ map in the sense that for any \(\chi ,\xi , \chi ',\xi ' \in {\mathcal {M}}\), \(d(\chi ,\xi )< d(\chi ',\xi ') \iff d(\varphi (\chi ),\varphi (\xi )) < d(\varphi (\chi '),\varphi (\xi '))\). Then, \(\mu D(\varphi (\chi ),P_{\varphi }) = \mu D(\chi ,P),\) where \(P_\varphi \) is the push-forward distribution of the image through \(\varphi \) of a random object of \({\mathcal {M}}\) having distribution P.

This follows from the fact that \(d(\varphi ({\mathcal {X}}_1),\varphi ({\mathcal {X}}_2)) > \max \left\{ d(\varphi ({\mathcal {X}}_1),\varphi (\chi )),d(\varphi ({\mathcal {X}}_2),\varphi (\chi )) \right\} \) \(\iff d({\mathcal {X}}_1,{\mathcal {X}}_2) > \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} \) for such a map \(\varphi \). These obviously include any isometry, such that \(d(\chi ,\xi ) = d(\varphi (\chi ),\varphi (\xi ))\), or other dilation-type transformations such that \(d(\chi ,\xi ) = a_\varphi d(\varphi (\chi ),\varphi (\xi ))\), for some positive scalar constant \(a_\varphi \), but not only. Clearly, \((P_1)\) establishes \(\mu D\) as a purely topological concept. On another note, (\(P_1\)) may be thought of as an extension of property P1 in Zuo and Serfling (2000, p. 463)—that a depth measure in \({\mathbb {R}}^d\)should not depend on the underlying coordinate system or, in particular, on the scales of the underlying measurements’.

3.2 Vanishing at infinity

Assume that \(({\mathcal {M}},d)\) is an unbounded metric space, i.e., \(\sup _{\chi ,\xi \in {\mathcal {M}}} d(\chi ,\xi ) = \infty \). Then:

\((P_2)\):

For any \(P \in {\mathcal {P}}\) and \(\chi \in {\mathcal {M}}\), \(\lim _{R \rightarrow \infty } \sup _{\xi \notin B_d(\chi ,R) }\mu D(\xi ,P)= 0\).

This follows from Proposition 1(a) in Cholaquidis et al. (2022). It is obviously the analogue to Zuo and Serfling (2000)’s P4: ‘The depth of a point x should approach 0 as \(\Vert x\Vert \) approaches infinity’.

Now suppose that, \(\forall \chi \in {\mathcal {M}}\),

$$\begin{aligned} {{\mathbb {P}}}\big ( d({\mathcal {X}}_1,{\mathcal {X}}_2) = \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} \big ) =0. \end{aligned}$$
(3.1)

This kind of continuity condition guarantees that, with probability 1, a given \(\chi \in {\mathcal {M}}\) will not lie exactly on the boundary of a random lens such as (2.2)—in fact, for \({\mathcal {M}}= {\mathbb {R}}\), this condition exactly amounts to the continuity of the distribution P. Then, we can prove the following properties \((P_3)\) and \((P_4)\).

3.3 Continuity in \(\chi \)

\((P_3)\) :

For any \(P \in {\mathcal {P}}\) such that (3.1) holds, \(\forall \chi \in {\mathcal {M}}\) and \(\forall \epsilon >0\), there exists \(\delta >0\) such that

$$\begin{aligned} \sup _{\xi : d(\chi ,\xi )<\delta } |\mu D(\xi ,P) - \mu D(\chi ,P)| < \epsilon . \end{aligned}$$

Indeed, for any \(\chi \in {\mathcal {M}}\), take \(\xi \in {\mathcal {M}}\) such that \(d(\chi ,\xi ) < \delta \), for some \(\delta >0\). Then, by the triangle inequality, for any \({\mathcal {X}}_1,{\mathcal {X}}_2 \in {\mathcal {M}}\),

$$\begin{aligned}{} & {} \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} - \delta< \max \left\{ d({\mathcal {X}}_1,\xi ),d({\mathcal {X}}_2,\xi ) \right\} \\{} & {} \quad < \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} + \delta . \end{aligned}$$

Hence, \(\mu D(\xi ,P) = {{\mathbb {P}}}\big ( d({\mathcal {X}}_1,{\mathcal {X}}_2) > \max \left\{ d({\mathcal {X}}_1,\xi ), d({\mathcal {X}}_2,\xi ) \right\} \big )\) is such that

$$\begin{aligned} {{\mathbb {P}}}\big ( d({\mathcal {X}}_1,{\mathcal {X}}_2)> \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} + \delta \big ) \\ \le \mu D(\xi ,P) \le {{\mathbb {P}}}\big ( d({\mathcal {X}}_1,{\mathcal {X}}_2) > \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} -\delta \big ). \end{aligned}$$

Now, see that \(\Psi (x) \doteq {{\mathbb {P}}}\left( \max \left\{ d({\mathcal {X}}_1,\chi ),d({\mathcal {X}}_2,\chi ) \right\} - d({\mathcal {X}}_1,{\mathcal {X}}_2) \le x \right) \) is a cumulative distribution function assumed to be continuous at \(x=0\) by (3.1). This means that, for any \(\epsilon >0\), we can find a \(\delta >0\) such that \(|\Psi (|\delta |)-\Psi (0)|< \epsilon \). As \(\Psi (0) = \mu D(\chi ,P)\), the claim follows.

A similar result appears in Cholaquidis et al. (2022, Proposition 2 in Supplementary Material) under a condition slightly stronger than (3.1).

3.4 Continuity in P

\((P_4)\) :

For any \(P \in {\mathcal {P}}\) such that (3.1) holds, \(\forall \chi \in {\mathcal {M}}\) and \(\forall \epsilon >0\), there exists \(\delta >0\) such that \(|\mu D(\chi ,Q) - \mu D(\chi ,P)| < \epsilon \) P-almost surely for all \(Q \in {\mathcal {P}}\) with \(d_{\mathcal {P}}(P,Q) <\delta \) P-almost surely, where \(d_{\mathcal {P}}\) metricises the topology of weak convergence on \({\mathcal {P}}\).

This follows directly from classical results on convergence of probability measures on separable metric spaces—e.g. Dudley (2002, Theorem 11.1.1)—as \(\mu D(\chi ,P)\) and \(\mu D(\chi ,Q)\) in (2.1) are simple probability statements on elements of \({\mathcal {A}}\times {\mathcal {A}}\). Note that (3.1) guarantees that the ‘lens’ (2.2) is a continuity set in the sense of Dudley (2002, section 11.1).

3.5 Further comments

Zuo and Serfling (2000) listed two more desirable properties for a depth measure on \({\mathbb {R}}^d\): ‘Maximality at centre’ and ‘Monotonicity relative to deepest point’ (their properties P2 and P3). Similar features are difficult to investigate here for \(\mu D\) without giving a stronger structure to \(({\mathcal {M}},d)\), such as some sort of convexity, or d to satisfy a parallelogram inequality, for example. As illustration, Zuo and Serfling (2000)’s P2 ‘Maximality at centre’ requires the depth to be maximum at a uniquely defined ‘centre’ with respect to some notion of symmetry. Without assuming a stronger structure \(({\mathcal {M}},d)\), even the very definition of symmetry in \({\mathcal {M}}\) is unclear. As our aim here is to stay as flexible as possible with the proposed metric depth, we do not investigate further in that direction. Those properties of \(\mu D\) may (or may not) be established on specific applications when \({\mathcal {M}}\) and d are precisely defined, though.

On a side note, even if Liu and Modarres (2011) supposedly showed (their Theorem 6) that their Euclidean ‘lens depth’ in \({\mathbb {R}}^d\)—of which (2.1) can be thought of as an extension—satisfies ‘Maximality at centre’ for centrally symmetric distributions, their proof appears wrong as pointed out in Kleindessner and von Luxburg (2017). Yet, Kleindessner and von Luxburg (2017) conceded that they believe that the statement is true. In Appendix, we give three counter-examples, establishing that the statement is actually not true: Liu and Modarres (2011)’s ‘lens depth’ does not generally satisfy neither ‘Maximality at centre’ nor ‘Monotonicity relative to deepest point’ (Zuo and Serfling 2000’s P2 and P3) for centrally symmetric distributions on \({\mathbb {R}}^d\).Footnote 3

A last important point is the following. Suppose that the balls \(B_d(\cdot ,\cdot )\) are convex in \(({\mathcal {M}},d)\). Then it can easily be checked that, for any non-degenerate distribution \(P \in {\mathcal {P}}\) (i.e., not a unit point mass at some \(\chi \in {\mathcal {M}}\)), \(\mu D\) cannot be degenerated in the sense that \(\mu D(\chi ,P) \equiv 0\) for all \(\chi \in {\mathcal {M}}\). Indeed, by convexity, the intersection \(B_d({\mathcal {X}}_1,d({\mathcal {X}}_1,{\mathcal {X}}_2)) \cap B_d({\mathcal {X}}_2,d({\mathcal {X}}_1,{\mathcal {X}}_2))\) is non-empty as soon as \({\mathcal {X}}_1 \ne {\mathcal {X}}_2\), so, there always exists some \(\chi \in {\mathcal {M}}\) which gets a positive depth by (2.1). It is known that some instances of statistical depth admit such a degenerate behaviour. For instance, that is the case of López-Pintado and Romo (2009, 2011)’s band and half-region depths for a wide class of distributions on common functional spaces (Chakraborty and Chaudhuri 2014, Theorems 3 and 4).

4 Empirical metric depth

4.1 Definition and main statistical properties

Assume now that we have a random sample \(\{\chi _i; i = 1,\ldots ,n \}\) drawn from the unknown distribution P on \({\mathcal {M}}\). Then the depth of some point \(\chi \in {\mathcal {M}}\) with respect to P must actually be estimated. The empirical analogue of (2.1) is naturally \(\mu D(\chi ,{\widehat{P}}_n)\), where \({\widehat{P}}_n\) is the empirical measure associated to \(\{\chi _i; i = 1,\ldots ,n \}\), i.e., the collection of 1/n-weighted point masses at \(\chi _1,\ldots ,\chi _n\). This yields

(4.1)

Obviously, \({\widehat{P}}_n \overset{P{\text {-a.s.}}}{\longrightarrow } P\), which guarantees under (3.1) the strong pointwise consistency of the estimator \({\mu }D(\chi ,{\widehat{P}}_n)\), that is

$$\begin{aligned} \mu D(\chi ,{\widehat{P}}_n) \overset{P{\text {-a.s.}}}{\longrightarrow } \mu D(\chi ,P), \end{aligned}$$
(4.2)

for all \(\chi \in {\mathcal {M}}\). This easily follows from Property \((P_4)\). Under stronger assumptions, Cholaquidis et al. (2021, Theorem 4) established the uniform strong consistency of \(\mu D\) on compact subsets of \({\mathcal {M}}\); viz.

$$\begin{aligned} \sup _{\chi \in \Phi } |\mu D(\chi ,{\widehat{P}}_n) - \mu D(\chi ,P)| \overset{\text {a.s.}}{\longrightarrow } 0, \qquad n \rightarrow \infty , \end{aligned}$$
(4.3)

for any compact set \(\Phi \subset {\mathcal {M}}\).

Finally, the obvious U-statistics structure of (4.1) allows us to easily deduce, through an appropriate Central Limit Theorem, the asymptotic normality of \(\mu D(\chi ,{\widehat{P}}_n)\), as a straightforward consequence of (Arcones and Giné 1993, Theorem 4.10) and (Giné 1996, Proposition 10); see also (Cholaquidis et al. 2021, Theorem 6). This result can be used for inference, for instance to build a confidence region for the ‘true’ median element, i.e. the deepest element with respect to the population distribution P (Serfling and Wijesuriya 2017). Note that this median need not be unique, and careful treatment is necessary in that case. Indeed, as discussed above, the depth is not guaranteed to be monotonically decreasing when moving away from the deepest point, making sets of deepest points not necessarily convex. In such cases, we may make use of the level sets of maximum depths, following results of Cholaquidis et al. (2021).

4.2 Computational aspects

The practical computation of the empirical metric depth (4.1) involves obtaining all pairwise distances between the elements of the set \(\{\chi \} \cup \{\chi _i; i = 1,\ldots ,n \}\); and checking \(d(\chi _i,\chi _j) > \max \{d(\chi _i,\chi ),d(\chi _j,\chi )\}\) for \(1 \le i < j \le n\). A simple (if not naive) implementation of this would be carried out in maximum \(O(n^{2})\) flops. Thus, the computational cost for computing the metric depth is in general lower than the simplicial depth, which requires \(O(n^{p+1})\) operations in \({\mathbb {R}}^{p}\), and Tukey depth, which involves infinite unidimensional projections. The above observation assumes that the dimension of the objects of interest is fixed; e.g., functional data are represented by vector of a certain length which does not vary with n. Now, even in such cases, the objects of interest may be complex and have dimension close to or larger than the sample size n—this may have to be taken into account when computing the distance d. For example, an \(L_{r}\)-distance, \(1 \le r \le \infty \), between functional data represented by a vector of size N, is computed in O(N) operations. If N is comparable to n, the total computational cost would effectively be \(O(Nn^{2})\).

5 Data examples

In this section, we illustrate the usefulness of the proposed metric depth \(\mu D\) on 5 real data sets: two one-dimensional functional datasets (Sects. 5.1 and 5.2), a bidimensional functional dataset (Sect. 5.3), a symbolic data set (Sect. 5.4) and a non-numeric data set (text) (Sect. 5.5).

Fig. 1
figure 1

Canadian weather data—average daily temperatures at 35 stations

Fig. 2
figure 2

Five deepest curves (left; the darkest curves are the deepest) and five least deep curves (right; the lightest curves are the least deep) according to (4.1) with \(d(\chi ,\xi ) = \Vert \chi -\xi \Vert _2\), the \(L_2\) distance

5.1 Canadian weather data

The Canadian temperature data set is a classical functional data set available from the R package fda. The data give the daily temperature records of 35 Canadian weather stations over a year (365 days, day 1 is 1st of January) averaged over 1960 to 1994, see Fig. 1. First, the depth of the 35 curves with respect to the sample has been computed from the empirical functional metric depth (4.1) with d being the usual \(L_2\) distance between two square-integrable functions, i.e. \(d^2(\chi ,\xi ) = \int \left( \chi (t)-\xi (t) \right) ^2\,dt\). The 5 deepest and least deep curves are shown in Fig. 2. The suggested depth measure identifies the Sherbrooke (deepest curve), Thunder Bay, Fredericton, Quebec and Calgary stations as the most representative of a median Canadian weather, in terms of temperature. On the other hand, the most outlying curves are seen to be Resolute (least deep curve), Victoria, Vancouver, Inuvik and Iqaluit. It is visually obvious that those curves are much different to the others: Resolute, Inuvik and Iqaluit are Arctic stations, with much colder temperatures across the year than the other stations, while Vancouver and Victoria lie on the south Pacific coast of Canada and enjoy much milder winters. We can appreciate that Vancouver and Victoria are ‘shape outliers’, whereas the Arctic stations are ‘location outliers’. These are equally easily flagged by the metric depth \(\mu D\)—this has to be stressed, as some functional depths have been shown to be able to identify one type of outlier but not the other, or vice-versa (Serfling and Wijesuriya 2017).

Of course, the daily average temperature curves are particularly noisy, which could heavily affect the \(L_2\)-distances computed between pairs of curves, hence the whole calculation of the depths. One can deal with the roughness of those curves in different manners: first, one could use smoothed versions of the initial curves, for instance the monthly average temperatures as in Serfling and Wijesuriya (2017); second, one could use for d a distance less affected by such noise than the \(L_2\) one, for instance the supremum (\(L_\infty \)) distance; finally one can expand the different curves in a certain basis and focus only on the first terms when assessing the proximity between them. We achieved that by expanding each curve in the empirical Principal Components basis (Hall 2011) and keeping only the first two principal scores: the curves re-constructed from those two components only are indeed smooth approximations to the initial, rough curves. So, each curve is now represented by a point in the 2-dimensional space of the first two Principal Components, and the proximity between two curves quantified by the \(L_2\)-distance between the corresponding two points. In effect, this defines a pseudo-distance between the initial curves (see Ferraty and Vieu 2006, section 3.4.1). The depths assigned to each station according to these 4 methods are shown in Table 1. The four depth measures are in very good agreement, essentially identifying the same central and outlying curves. This shows that the depth measure \(\mu D\) (2.1) and its empirical version (4.1) are quite robust to any reasonable choice of d.

Table 1 Canadian weather data—metric depth measures for 4 different (pseudo-)distances d: \(\mu D_2\): \(L_2\) distance; \(\mu D_\infty \): Supremum (\(L_\infty \)) distance; \(\mu D_2^\text {m}\): \(L_2\) distance on the average monthly temperature curves; \(\mu D_\text {PCA}\): \(L_2\) distance in the plane of the first two principal components

5.2 Lip movement data

Malfait and Ramsay (2003) studied the relationship between lip movement and time of activation of different face muscles, see also (Ramsay and Silverman 2002, Chapter 10) and Gervini (2008). The study involved a subject saying the word ‘bob’ 32 times and the movement of their lower lip was recorded each time. Those trajectories are shown in Fig. 3, and all share the same pattern: a first peak corresponding to the firt /b/, then a plateau corresponding to the /o/ and finally a second peak for the second /b/. These functions being very smooth (actually, they are smoothed versions of raw data not publicly available), it seems natural to use again the classical \(L_2\) distance for assessing their relative proximity. Hence, the respective depth of each curve with respect to the sample was obtained by (4.1) with \(d^2(\chi ,\xi ) = \int \left( \chi (t)-\xi (t) \right) ^2\,dt\). The 5 deepest and 5 least deep curves are shown in the top row of Fig. 4. In particular, this depth identifies as outliers the three curves showing a second peak at a much later time than for the rest of the curves, which were already hived off by Gervini (2008). The remaining two outlying curves show two peaks of lower amplitude than the others, with a second peak occurring earlier than the bunch.

Fig. 3
figure 3

Lip movement data

Fig. 4
figure 4

Top and middle row: Five deepest curves (left; the darkest curves are the deepest) and five least deep curves (right; the lightest curves are the least deep) according to (4.1) with (i) \(d(\chi ,\xi ) = \Vert \chi -\xi \Vert _2\), the \(L_2\) distance between the curves (top row) and (ii) \(d(\chi ,\xi ) = \Vert \chi ''-\xi ''\Vert _2\), the \(L_2\) distance between the second derivatives of the curves (middle row). Bottom row: Five deepest acceleration curves (left; the darkest curves are the deepest) and five least deep acceleration curves (right; the lightest curves are the least deep)

Now, Malfait and Ramsay (2003), in their original study, were more interested in the acceleration of the lip during the process rather than on the lip motion itself. The study aimed at explaining time of activation of face muscles, and the acceleration reflects the force applied to tissue by muscle contraction. Hence, in this application, it may be worth contrasting the lip trajectories in terms of their corresponding accelerations, that is, comparing the second derivatives of the position curves. The \(L_2\) distance between the second derivatives of the curves is naturally a pseudo-distance between the initial curves (Ferraty and Vieu 2006, Section 3.4.3), which can be used in (4.1). The 5 deepest and 5 least deep curves, according to (4.1) based on the ‘acceleration’ pseudo-distance, are shown in the middle row of Fig. 4 and differ from those in the first row of Fig. 4. Naturally, the focus here is no more on the exact position of the curves, but rather on the more fundamental underlying dynamics. For instance, the 5 deepest curves show a first peak of distinctly different heights, but in terms of their second derivatives, they are in fact quite similar and representative of the sample (bottom row of Fig. 4), and that is what matters in Malfait and Ramsay (2003)’s study. As argued in Sect. 1, the flexibility of \(\mu D\) (2.1) in terms of the choice of d allows the analyst to tailor the depth measure to the given factors and the goal of the analysis.

5.3 Handwriting data

The ‘handwriting’ data set consists of twenty replications of the printing of the three letters ‘fda’ by a single individual. The position of the tip of the pen has been sampled 200 times per second. The data, available in the R package fda, have already been pre-processed so that the printed characters are scaled and oriented appropriately, see Fig. 5.

Fig. 5
figure 5

Handwriting data

These data are essentially bivariate functional data. Indeed, each instance \(\chi \) of the word ‘fda’ arises through the simultaneous realisation of two components \((\chi _X(t),\chi _Y(t))\), where \(\chi _X(t)\) and \(\chi _Y(t)\) give the position along the horizontal axis and the vertical axis, respectively, of the pen at time t. This is illustrated for one instance of ‘fda’ in Fig. 6. Hence, an appropriate functional metric space here could be \(({\mathcal {M}},d)\) with \({\mathcal {M}}= L_2(T) \times L_2(T)\), \(T=[0,2.3]\) (the time interval on which the position of the pen was recorded) and d being the Euclidean distance on \(L_2(T) \times L_2(T)\) whose square is defined by

$$\begin{aligned}{} & {} d^2(\chi ,\xi ) = \Vert \chi _X - \xi _X \Vert _2^2 + \Vert \chi _Y - \xi _Y \Vert _2^2 = \int _T \nonumber \\{} & {} \quad \left( \chi _X(t)-\xi _X(t) \right) ^2\,dt + \int _T\left( \chi _Y(t)-\xi _Y(t) \right) ^2\,dt. \end{aligned}$$
(5.1)

This distance can be used directly in (4.1) to identify the 5 deepest and 5 least deep instances of ‘fda’, see Fig. 7. The bivariate nature of the data at hand does not cause any particular complication and the definition (2.1) need not be re-adapted to this case. Again, the so-defined depth only focuses on the ‘drawings’ fda themselves, and identifies the deepest instances. However, it was argued in the related literature that the tangential acceleration of the pen during the process was also a key element to analyse for understanding the writing dynamics, for instance for discriminating between genuine handwritings and forgeries (Geenens 2011a, b). As in Subsection 5.2, one could therefore use (4.1) with d a pseudo-distance assessing the proximity between two instances of fda through their tangential acceleration curves only, if that was to be the focus of the analysis.

Fig. 6
figure 6

One instance of the handwriting data, and its x- and y-components

Fig. 7
figure 7

Five deepest curves (left; the darkest curves are the deepest) and five least deep curves (right; the lightest curves are the least deep) according to (4.1) with \(d(\chi ,\xi )\) being the \(L_2\) distance (5.1) on \(L_2(T) \times L_2(T)\)

5.4 Age distribution in European countries

Symbolic Data Analysis (SDA) has recently grown as a popular research field in statistics (Billard and Diday 2003, 2007). Indeed the intractably large ‘Big Data’ sets often need to be summarised so that the resulting summary datasets are of a manageable size, and so-called ‘symbolic data’ typically arise from such a process. No longer formatted as single values like classical data, they are meant to be ‘aggregated’ variable typically represented by lists, intervals, histograms, distributions and the like. In this section we give a closer look at a ‘distribution-valued’ symbolic data set. Specifically, we analyse the distribution of the age of the population of the 44 european countries (see Table 2).

The 2017 data were obtained from the US Census bureau (www.census.gov/population/international/data/). Typically, the population distribution for a given country is presented under the form of a population pyramid (that is, a histogram), from which a proper distribution function for population age can easily be extracted (Kosmelj and Billard 2011). Hence, each country (here: ‘individual’, also called ‘concept’ in the SDA literature) is characterised by a distribution. Figure 8 displays the sample of age distributions. Here we will use the suggested metric depth \(\mu D\) to analyse which countries are most representative of the ‘European’ age distribution, and which countries can be regarded as ‘outliers’ in that respect.

Table 2 Age distribution in European countries—metric depth for the age distributions of the 44 European countries, based on the Wasserstein distance
Fig. 8
figure 8

Age distribution in European countries

The data being here distribution functions of nonnegative variables, \({\mathcal {M}}\) can be identified with a space of distribution functions supported on \({\mathbb {R}}^+\), i.e. a space of nondecreasing càdlàg functions F with \(F(0)=0\) and \(\lim _{t \rightarrow \infty } F(t) =1 \), equipped with an appropriate distance. The Wasserstein distance has proved useful for a wide range of problems explicitly involving distribution functions (Rachev 1984; Panaretos and Zemel 2020), hence seems a natural choice in this setting as well. For some \(r \ge 1\), the Wasserstein distance between two distributions F and G whose rth moments exist, is defined as

$$\begin{aligned}d_r(F,G) = \inf _{(X,Y)\sim (F,G)} \{E\left( |X-Y|^r\right) \}^{1/r}, \end{aligned}$$

where the infimum is taken over the set of all joint bivariate distributions whose marginal distributions are F and G respectively. Properties of this distance are described in Major (1978) and Bickel and Freedman (1981). In particular, it is known that \(d_r(F,G)\) is essentially the usual \(L_r\)-distance between the quantile functions \(F^{-1}\) and \(G^{-1}\) over [0, 1]. Also, it is known that convergence in the Wasserstein distance is equivalent to convergence in distribution together with convergence of the first r moments. Hence, the distance \(d_r\) quantifies the proximity between two distributions through both their general appearance and the values of their moments. In what follows, we take \(r=2\), hence we consider functional data in \(({\mathcal {M}}_2,d_2)\), \({\mathcal {M}}_2\) being the space of all probability distribution functions with finite second moment.

The flexibility of (2.1) allows us to base \(\mu D\) on the Wasserstein distance so as to define a depth measure specific to distribution functions without any difficulty. The ‘Wasserstein-depths’ of the 44 countries are given in Table 2. The 5 deepest and least deep age distributions are shown in Fig. 9. The deepest distribution, hence the most representative of the age distributions in Europe, appears to be that of Switzerland, a country located at the very heart of Europe, in-between the Western and Eastern countries, and in-between the Northern countries and the Southern countries, at the meeting point between the ‘Germanic’ world (Germany, Austria) and the ‘Latin’ world (France, Italy). From that perspective, Switzerland can be regarded as really representative of a ‘median’ European country on many aspects. On the other hand, the Wasserstein-metric depth is null for Kosovo and Monaco, and indeed, the distributions for those two countries clearly lie outside the bunch of the other distributions. Monaco is a micro, mild-climate (and incidentally, tax haven) state which attracts a large amount of rich retirees from all over the continent (if not the world), hence its population is globally much older than for other countries and its age distribution is below the others. Monaco set aside, Germany and Italy show globally the oldest population of Europe. Kosovo was still recently at the heart of an armed conflict in the Balkans, which explains the low proportion of older people in that country and the position of its age distribution above all the others. To some extent, this also explains the outlyingness of Albania’s curve. In any case, this example illustrates that one can readily define a depth measure tailored for distribution curves, which paves the way for developing rank-like procedures in Symbolic Data Analysis as well.

Fig. 9
figure 9

Five deepest age distributions (left; the darkest curves are the deepest) and five least deep age distributions (right; the lightest curves are the least deep) according to (4.1) with d being the Wasserstein distance \(d_2\) between distributions

Table 3 Matrix of ‘inter-textual’ distances between 9 essential plays by Thomas Middleton: Phn: ‘The Phoenix’; Mad: ‘A Mad World, My Masters’; Trk: ‘A Trick to Catch the Old One’; Pur: ‘The Puritan’; Alm: ‘The Almanac’; CMC: ‘A Chaste Maid in Cheapside’; Dis: ‘More Dissemblers Besides Women’; Val: ‘The Nice Valour’; WBW: ‘Women Beware Women’
Table 4 37 Shakespeare’s plays (shown in chronological order)—empirical \(\mu D\) in the sample of Shakespeare’s plays only (left column) and in the sample of combined Shakespeare’s and Middleton’s works (right column)
Fig. 10
figure 10

Top row: Example A.1; Central row: Example A.2; Bottom row: Example A.3. From left to right, density function (first column), sample lens depth constructed with \(n=5,000\) sample draws from P (second column), corresponding heat-map (third column) and its section along the line \(x_{2}=0\) (top-right panel), \(x_{1}=x_{2}\) (central-right panel) and \(x_{1}=0\) (bottom-right panel)

5.5 Authorship attribution by intertextual distance

Author identification on an unknown or doubtful text is one of the oldest statistical problems applied to literature. Here the capability of the proposed metric depth is illustrated within that framework. William Shakespeare and Thomas Middleton were contemporaries (late 16th-early 17th centuries), and their oeuvre are often compared. In that aim, Merriam (2003) examined 9 Middleton plays and 37 Shakespeare texts, and computed between each pair of them the so-called ‘inter-textual distance’ proposed by Labbé and Labbé (2001).Footnote 4 Although the entities of interest are here purely non-numerical (famous literary pieces), the obtained matrix of distances allows us to outline the relative position of each text—and this is essentially all what is needed for \(\mu D\) to come into play.

As an example, Table 3 (recovered from Appendix 2 in Merriam (2003)) reports the ‘inter-textual’ distances between the 9 essential plays of Middleton. Computing the empirical metric-depth (4.1) on each of this entry in the ‘Middleton sample’ reveals that the two deepest observations are ‘More Dissemblers Besides Women’ and ‘A Trick to Catch the Old One’ (both get a depth of 0.4167). They may, therefore, be considered as the most typical Middleton plays (as long as the ‘inter-textual’ distance is the relevant metric).

This time focusing on the 37 Shakespeare texts only, ‘Antony and Cleopatra’ is identified as Shakespeare’s most typical text; i.e., the deepest among the considered sample (depth: 0.5255)—see Table 4 (left column). The following most representative of Shakespeare plays are ‘The Tempest’ (0.5135), ‘Othello’ (0.5030) and ‘Romeo and Juliet’ (0.5015). The most outlying piece of work is the verset part of ‘Henry V’ (depth: 0), which tends to confirm a common conjecture hold by many experts on Shakespeare’s oeuvre: the verset part of ‘Henry V’ was not written by Shakespeare himself, but by Christopher Marlowe (Merriam 2002).

Now, if we computed the metric depth of the 9 Middleton’s plays in Shakespeare’s sample, all would receive depth 0—all are ‘outlying’ in Shakespeare’s oeuvre. This clearly indicates that Middleton’s work cannot be confused with Shakespeare’s, and it should be easy to assign a new piece of text to one or the other based on \(\mu D\). Further, it is interesting to analyse the depth of each text in a combined sample made up both the works of Middleton and Shakespeare. In particular, some of Shakespeare’s texts which have a low depth in the ‘Shakespeare’s only’ sample, see their depth increase by large in the combined sample. This indicates that these pieces may have a strong Middleton flavour, to some extent. This hypothesis is confirmed for at least one of those plays: ‘Timon of Athens’ sees its depth increase from 0.1141 to 0.3971 if one includes Middleton’s works in the reference sample; and indeed, extensive research on the topic has provided ample evidence that Middleton wrote approximately one third of that play (Taylor 1987).

Note that computing and comparing the depth of certain observations in two different samples is the spirit of the DD-plot and the DD-classifier proposed by Li et al. (2012). These procedures can naturally be used in conjunction with the metric depth \(\mu D\), enabling similar powerful depth-based analyses in abstract metric spaces.

6 Conclusion

In this paper, we have proposed a new statistical depth function, called ‘metric depth’ or just \(\mu D\), defined in an abstract metric space. It is explicitly constructed on a certain distance d that must be chosen by the analyst, which allows them to tailor the depth to the data at hand and to the ultimate goal of the analysis. This offers an unmatched flexibility about the range of problems and applications that can be addressed using the said depth measure. The usefulness of \(\mu D\) has been illustrated on several real data sets, including one in the emergent field of Symbolic Data Analysis and an application in text mining (authorship attribution). Rejuvenating an old idea of Bartoszynski et al. (1997), its definition is very intuitive: the depth of a functional point \(\chi \) with respect to a distribution P is the probability to find it ‘between’ two functional objects \({\mathcal {X}}_1\) and \({\mathcal {X}}_2\) randomly generated from P, ‘between’ meaning here that \(\chi \) belongs to the intersection of the two open d-balls \(B_d({\mathcal {X}}_1,d({\mathcal {X}}_1,{\mathcal {X}}_2))\) and \(B_d({\mathcal {X}}_2,d({\mathcal {X}}_1,{\mathcal {X}}_2))\). This definition is natural and enjoys many pleasant properties.