1 Introduction

Point pattern data are abundant in modern scientific studies. From biomedical imagery over geo-referenced disease cases and positions of mobile phone users to climate change-related space–time events, such as landslides, we have more and more complicated data available. See Chiaraviglio et al. (2016), Lombardo et al. (2018), Konstantinoudis et al. (2019), Samartsidis et al. (2019) for individual examples and the textbooks Diggle (2013), Baddeley et al. (2015), Błaszczyszyn et al. (2018) for a broad overview of further applications. While a few decades ago, data consisted typically of a single point pattern in a low-dimensional Euclidean space, maybe with some low-dimensional mark information, we have nowadays often multiple observations of point patterns available that may live on more complicated spaces, e.g., manifolds (including shape spaces), spaces of convex sets or function spaces. A setting that has received a particularly large amount of attention recently is point patterns on graphs, such as street networks, see Moradi et al. (2018), Moradi and Mateu (2019) and Rakshit et al. (2019) among others.

Fig. 1
figure 1

An example of barycenters computed by our algorithm for three different data sets. In each panel, there are three data point patterns indicated by different symbols (black). The resulting (pseudo-)barycenter pattern with respect to Euclidean distance is given by the blue circles (\(p=q=2\)). (Color figure online)

Multiple point pattern observations may occur by i.i.d. replication (e.g., of a biological experiment), but may also be governed by one or several covariates or form a time series of possibly dependent patterns. Additional mark information can easily be high-dimensional. Methodology for treating such point pattern data in all these situations is the subject of ongoing statistical research, see, e.g., Baddeley et al. (2015).

From a more abstract point of view, consider the set \(\mathfrak {N}_{\mathrm {fin}}\) of finite counting measures on some metric space \((\mathcal {X},d)\). If we manage to equip \(\mathfrak {N}_{\mathrm {fin}}\) with a metric \(\tau \) that reflects the concept of distance between point patterns in an appropriate problem-related way, there are a number of standard methods which can be applied, including multidimensional scaling, discriminant and cluster analysis techniques. This is a stance already taken in Schuhmacher (2014), Section 1.4, and Mateu et al. (2015). In the metric space \((\mathfrak {N}_{\mathrm {fin}}, \tau )\), we can furthermore define a Fréchet mean of order \(q \ge 1\); that is, for data \(\xi _1,\ldots ,\xi _k \in \mathfrak {N}_{\mathrm {fin}}\) any \(\zeta \in \mathfrak {N}_{\mathrm {fin}}\) minimizing

$$\begin{aligned} \sum _{j=1}^k \tau (\xi _j,\zeta )^q. \end{aligned}$$
(1)

Such a q-th-order mean may serve as a “typical” element of \(\mathfrak {N}_{\mathrm {fin}}\) to represent the data and gives rise to more complex statistical analyses, such as Fréchet regression; see Lin and Müller (2019) and Petersen and Müller (2019).

Two metrics on the space of point patterns that have been widely used are the spike time metric, see Victor and Purpura (1997) for one dimension and Diez et al. (2012) for higher dimension, and the optimal subpattern assignment (OSPA) metric, see Schuhmacher and Xia (2008) and Schuhmacher et al. (2008). In the present paper, we introduce the transport–transform (TT) metric and its normalized version, the relative transport–transform (RTT) metric, which provide a unified framework for the earlier metrics. Both the TT- and the RTT-metrics are based on matching the points between two point patterns on \(\mathcal {X}\) optimally in terms of some power p of d and penalizing points that cannot be reasonably matched. We may interpret these metrics as unbalanced p-th-order Wasserstein metrics, see Remark 3 below. In the present paper, we always set \(p=q\).

Among others Schoenberg and Tranbarger (2008), Diez et al. (2012) and Mateu et al. (2015) have treated Fréchet means of order 1 (medians) for the spike time metric under the name of prototypes. However, computations in 2d and higher were only possible for very small data sets due to a prohibitive computational cost of \(O(n^6)\) for the distance between two point patterns with n points each. In the present work, we use an adapted auction algorithm that is able to compute TT- and RTT-distances between point patterns in \(O(n^3)\). We further provide a heuristic algorithm that bears some resemblance to a k-means cluster algorithm and is able to compute local minima of the barycenter problem very efficiently. This makes it possible to compute “quasi-barycenters” for 100 patterns of 100 points in \(\mathbb {R}^2\) in a few seconds when basing the TT-distance on the Euclidean distance between points and choosing \(p=q=2\).

In Fig. 1, we show some typical barycenters obtained by our algorithm in this setting. We use smaller data sets for better visibility. In each scenario, there are three different point patterns distinguished by the different symbols in black. The (pseudo-)barycenter represented by the blue circles captures the characteristics of each data set rather well. Some minor irregularities, especially in the third panel, may be due to the fact that only a (good) local optimum is computed.

More important than being fast for point pattern data on \(\mathbb {R}^D\) when using squared Euclidean distances is the fact that our algorithm provides a general plug-in method that can in principle be used for point patterns on any underlying space \(\mathcal {X}\) where an appropriate “cost function” between objects is specified as p-th power of a metric d. All that is required is an algorithm that finds (maybe heuristically) a p-th-order Fréchet mean for individual points in \(\mathcal {X}\), i.e., finds \(z \in \mathcal {X}\) minimizing \(\sum _{j=1}^k d(x_j,z)^p\) for any given \(x_1,\ldots ,x_k \in \mathcal {X}\). We refer to this in what follows as the underlying location problem. The reduction to the underlying location problem allows us to treat the case of point patterns on a network equipped with the shortest-path metric and \(p=1\). Figure 2 gives an example for crime data in Valencia, Spain, which we study in more detail in Sect. 6.

Fig. 2
figure 2

An example of a barycenter on a street network. Shown are 8 patterns of assault crimes during the summer months of 2010–2017 in the old town of Valencia (all in gray for better overall visibility). The resulting barycenter with respect to shortest-path distance along the streets is given in blue, with multipoints in purple (\(p=q=1\)). (Color figure online)

The barycenter problem we consider in this paper is closely related to the problem of computing an unbalanced Wasserstein barycenter, see, e.g., Chizat et al. (2018). However, rather than minimizing a Fréchet functional on the space of all measures, we minimize on the space \(\mathfrak {N}_{\mathrm {fin}}\) of \(\mathbb {Z}_{+}\)-valued measures, see Remark 5.

The plan of the paper is as follows. In Sect. 2, we introduce the TT- and RTT-metrics and discuss their relations to spike time, OSPA, and incomplete Wasserstein metrics. Section 3 specifies what we mean by a barycenter (or Fréchet mean) with respect to these metrics and gives an important result that forms the basis for our heuristic algorithm. Two versions of this algorithm, a more direct one and an improved one, which saves computation steps that are unlikely to substantially influence the final result, are discussed in detail in Sect. 4, along with some practical aspects. Section 5 contains a larger simulation study, which investigates robustness and runtime performances of the two algorithms for the case of Euclidean distance and \(p=2\). Finally, we give two applications to data of crime events on a city map for real data in Sect. 6. The first one concerns street thefts in Bogotá, Colombia. We treat this again as data in Euclidean space, using \(p=2\). The second one deals with assault cases in the streets of Valencia, Spain. Here, we compute barycenters based on the actual shortest-path distance on the street network and use \(p=1\).

2 The transport–transform metric

Denote by \(\mathfrak {N}_{\mathrm {fin}}\) the space of finite point patterns (counting measures) on a complete separable metric space \((\mathcal {X},d)\), equipped with the usual \(\sigma \)-algebra \(\mathcal {N}_{\mathrm {fin}}\) generated by the point count maps \(\varPsi _A:\mathfrak {N}_{\mathrm {fin}}\rightarrow \mathbb {R}\), \(\xi \mapsto \xi (A)\) for \(A \subset \mathcal {X}\) Borel measurable. Elements of \(\mathfrak {N}_{\mathrm {fin}}\) are typically denoted by \(\xi ,\eta ,\zeta \) here. As usual, we write \(\delta _x\) for the Dirac measure with unit mass at \(x \in \mathcal {X}\). In the present section, we mostly use measure notation such as \(\xi = \sum _{i=1}^n \delta _{x_i}\), \(\xi (\{x\}) \ge 1\) or \(\xi + \eta \), but in later sections we also use corresponding (multi)set notation such as \(\xi = \{x_1,\ldots ,x_n\}\), \(x \in \xi \) or \(\xi \cup \eta \) where this is unambiguous.

We use \(|\xi | = \xi (\mathcal {X})\) to denote the total number of points in the pattern \(\xi \). For \(n \in \mathbb {Z}_{+}= \{0,1,2,\ldots \}\) write \([n] = \{1,2,\ldots ,n\}\) (including \([0] = \emptyset \)) and denote by \(\mathfrak {N}_n\) the set of point patterns with exactly n points. We first introduce the metrics we use on \(\mathfrak {N}_{\mathrm {fin}}\), which unify and generalize two of the main metrics used previously in the literature.

Definition 1

Let \(C > 0\) and \(p \ge 1\) be two parameters, referred to as penalty and order, respectively.

  1. (a)

    For \(\xi = \sum _{i=1}^m \delta _{x_i}, \eta = \sum _{j=1}^n \delta _{y_j} \in \mathfrak {N}_{\mathrm {fin}}\), define the transport–transform (TT) metric by

    $$\begin{aligned} \tau&(\xi ,\eta ) = \tau _{C,p}(\xi ,\eta ) \nonumber \\&=\biggl ( \min \biggl ( (m+n-2l) C^p +\sum _{r=1}^{l} d(x_{i_r},y_{j_r})^p \biggr ) \biggr )^{1/p}, \end{aligned}$$
    (2)

    where the minimum is taken over equal numbers of pairwise different indices \(i_1,\ldots ,i_l\) in [m] and \(j_1,\ldots ,j_l\) in [n], i.e. over the set

    $$\begin{aligned} \begin{aligned} S(m,n) = \bigl \{&(i_1,\ldots ,i_l;j_1,\ldots ,j_l)\,;\; \\&l \in \{0,1,\ldots ,\min \{m,n\}\},\\&i_1, \ldots , i_l \in [m] \text { pairwise different},\, \\&j_1, \ldots , j_l \in [n] \text { pairwise different} \bigr \}. \end{aligned} \end{aligned}$$
  2. (b)

    For \(\xi , \eta \in \mathfrak {N}_{\mathrm {fin}}\), define the relative transport–transform (RTT) metric by

    $$\begin{aligned} {\bar{\tau }}(\xi ,\eta ) = {\bar{\tau }}_{C,p}(\xi ,\eta ) = \frac{1}{\max \{|\xi |,|\eta |\}^{1/p}} \tau _{C,p}(\xi ,\eta ).\nonumber \\ \end{aligned}$$
    (3)

We state and prove below that \(\tau \) and \({\bar{\tau }}\) are indeed metrics.

The following result simplifies proofs of statements about these metrics and is furthermore invaluable for their computation. The idea is to extend the metric space \((\mathcal {X},d \wedge (2^{1/p} C))\), where \([d \wedge (2^{1/p} C)](x,y) = \min \{ d(x,y), 2^{1/p} C \}\), by setting \(\mathcal {X}' = \mathcal {X}\cup \{\aleph \}\) for an auxiliary element \(\aleph \not \in \mathcal {X}\) and

$$\begin{aligned} d'(x,y) = {\left\{ \begin{array}{ll} \min \{ d(x,y), 2^{1/p} C \} &{}\text {if } x,y \in \mathcal {X}; \\ C &{}\text {if } \aleph \in \{x,y\}, \ x \ne y; \\ 0 &{}\text {if } x=y=\aleph . \end{array}\right. } \end{aligned}$$

It is shown in Lemma A.1 that \((\mathcal {X}',d')\) is a metric space again. We may then compute distances in the \(\tau \) and \({\bar{\tau }}\) metrics by solving an optimal matching problem between point patterns with the same cardinality. For \(n \in \mathbb {N}\) denote by \(S_n\), the set of permutations on [n].

Theorem 1

Let \(\xi = \sum _{i=1}^m \delta _{x_i}, \eta = \sum _{j=1}^n \delta _{y_j} \in \mathfrak {N}_{\mathrm {fin}}\), where w.l.o.g. \(m \le n\) (otherwise swap \(\xi \) and \(\eta \)). Set \(x_i = \aleph \) for \(m+1 \le i \le n\) and \({\tilde{\xi }}= \sum _{i=1}^n \delta _{x_i}\). Then,

$$\begin{aligned} \begin{aligned} \tau (\xi ,\eta )&= \biggl (\min _{\pi \in S_n} \sum _{i=1}^n d'(x_i,y_{\pi (i)})^{p} \biggr )^{1/p} \quad \text {and} \\ \quad {\bar{\tau }}(\xi ,\eta )&= \biggl (\frac{1}{n} \min _{\pi \in S_n} \sum _{i=1}^n d'(x_i,y_{\pi (i)})^{p} \biggr )^{1/p}. \end{aligned} \end{aligned}$$

The proof of this and the other theorems in this section can be found in the appendix.

Remark 1

(Computation of TT- and RTT-metrics) Writing n for the maximum cardinality as in Theorem 1, this result shows that we can compute both \(\tau (\xi ,\eta )\) and \({\bar{\tau }}(\xi ,\eta )\) in worst-time complexity of \(O(n^3)\) by using the classic Hungarian method for the assignment problem; see Kuhn (1955). In practice, we use the auction algorithm proposed in Bertsekas (1988), because it has usually much better runtime in our experience, although the default version has a somewhat worse worst-case performance of \(O(n^3 \log (n))\).Footnote 1

Theorem 2

The maps \(\tau \) and \({\bar{\tau }}\) are metrics on \(\mathfrak {N}_{\mathrm {fin}}\).

The next result establishes our previous claim that the new transport–transform construction generalizes two metrics on \(\mathfrak {N}_{\mathrm {fin}}\) previously used in the literature.

Theorem 3

  1. (a)

    If \(p=1\), then for any \(\xi ,\eta \in \mathfrak {N}_{\mathrm {fin}}\)

    $$\begin{aligned} \tau (\xi ,\eta ) = \min _{(\xi _0, \ldots , \xi _N)} \sum _{i=0}^{N-1} c_{\text {elem}}(\xi _i,\xi _{i+1}), \end{aligned}$$
    (4)

    where the minimum is taken over all \(N \in \mathbb {N}\) and all paths \((\xi _0, \ldots , \xi _N) \in \mathfrak {N}_{\mathrm {fin}}^{N+1}\) such that \(\xi _0 = \xi \), \(\xi _N = \eta \), and from \(\xi _i\) to \(\xi _{i+1}\) either a single point is added or deleted at cost \(c_{\text {elem}}(\xi _i,\xi _{i+1}) = C\) or a single point is moved from x to y at cost \(c_{\text {elem}}(\xi _i,\xi _{i+1}) = d(x,y)\).

  2. (b)

    If \({{\,\mathrm{diam}\,}}(\mathcal {X}) = \sup _{x,y \in \mathcal {X}} d(x,y) \le 2^{1/p} C\), then for any \(\xi = \sum _{i=1}^m \delta _{x_i}, \eta = \sum _{j=1}^n \delta _{y_j} \in \mathfrak {N}_{\mathrm {fin}}\), assuming w.l.o.g. \(m \le n\)

    $$\begin{aligned} {\bar{\tau }}(\xi ,\eta )^p = \frac{1}{n} \biggl ((n-m) C^p + \min _{\pi \in S_n} \sum _{i=1}^m d(x_i,y_{\pi (i)})^p \biggr ).\nonumber \\ \end{aligned}$$
    (5)

Theorem 3(a) implies that the TT-metric is the same as the spike time metric (using add and delete penalties \(P_a=P_d=C\) and a move penalty \(P_m=1\)), which was originally introduced on \(\mathbb {R}_{+}\) by Victor and Purpura (1997) and generalized to metric spaces by Diez et al. (2012). It can be seen from the proof in the appendix that the right hand side of (4) is not a metric in general if \(p > 1\).

Theorem 3(b) implies that the RTT-metric is the same as the OSPA metric, introduced in Schuhmacher and Xia (2008) and Schuhmacher et al. (2008). Note that in the definition of the OSPA metric \({{\,\mathrm{diam}\,}}(\mathcal {X}) \le C \le 2^{1/p} C\) was either required or enforced by taking the minimum of d with C. Here, it can be seen that the right hand side of (5) is not a metric in general if \({{\,\mathrm{diam}\,}}(\mathcal {X}) > 2 C\).

Remark 2

(Computation of spike time distances) The spike time distances in Victor and Purpura (1997) and Diez et al. (2012) allowed for separate add and delete penalties \(P_a\) and \(P_d\), as well as a move penalty \(P_m\) (factor in front of d(xy)). We set here \(P_a=P_d=C\) to obtain a proper metric and divide distances by \(P_m\), which is just a scaling. Thus, the parameter \(C = P_a/P_m = P_d/P_m\) is all that remains.

As noted at the end of Section 4 in Diez et al. (2012), having different add and delete penalties may be useful for controlling the total number of points in a barycenter point pattern. Let us point out therefore that Theorem 1 is easily adapted to this more general situation by setting \(d'(x,y) = \min \{d(x,y), 2^{1/p} (P_a+P_d)\}\), \(d'(\aleph ,y) = P_a\) and \(d'(x,\aleph ) = P_d\) for all \(x,y \in \mathcal {X}\).

In particular, this yields a worst-time complexity of \(O(n^3)\) for general (maybe asymmetric) spike time distances in general metric spaces, which is a substantial improvement over the \(O(n^6)\) complexity of the incremental matching algorithm presented in Diez et al. (2012).

Remark 3

(Unbalanced Wasserstein metrics) The TT- and RTT-metrics can be seen as unbalanced Wasserstein metrics, see, e.g., Chizat et al. (2018), Liero et al. (2018) and the references therein. Minimizing over the space \(\mathfrak {M}_{\mathrm {fin}}\) of all finite measures on \(\mathcal {X}\times \mathcal {X}\), we obtain the TT-distance as a solution to a particular instance of the unbalanced optimal transport problem in Chizat et al. (2018), Definition 2.11, namely

$$\begin{aligned} \begin{aligned} \tau (\xi ,\eta )^p&= \inf _{\gamma \in \mathfrak {M}_{\mathrm {fin}}} \biggl (\int _{\mathcal {X}\times \mathcal {X}} d(x,y)^p \; \gamma (dx, dy) \\&\quad +C^p \Vert \xi -\gamma _1 \Vert _{\mathrm {TV}} + C^p \Vert \eta -\gamma _2 \Vert _{\mathrm {TV}} \biggr ), \end{aligned} \end{aligned}$$
(6)

where \(\gamma _1 = \gamma (\cdot \times \mathcal {X})\) and \(\gamma _2 = \gamma (\mathcal {X}\times \cdot )\) denote the marginals of \(\gamma \), and \(\Vert \cdot \Vert _{\mathrm {TV}}\) is the total variation norm of signed measures; specifically \(\Vert \mu -\nu \Vert _{\mathrm {TV}} = \sup _{A} (\mu (A)-\nu (A)) + \sup _{A} (\nu (A)-\mu (A))\) for \(\mu , \nu \in \mathfrak {M}_{\mathrm {fin}}\), where the suprema are taken over all measurable subsets of \(\mathcal {X}\).

Equation (6) can be shown as follows. It is straightforward to see that we may take the infimum on the right hand side only over \(\gamma \in \mathfrak {M}_{\mathrm {fin}}\) with marginals \(\gamma _1 \le \xi \) and \(\gamma _2 \le \eta \), because any additional mass in \(\gamma \) may be removed without increasing the total cost of \(\gamma \). Writing \(\xi = \sum _{i=1}^n \delta _{x_i}\) and \(\eta = \sum _{i=1}^n \delta _{y_i}\) with the help of additional points at \(\aleph \) (if necessary), we obtain by similar arguments as in the proof of Theorem 1 that the latter problem is equivalent to the discrete transportation problem

$$\begin{aligned} \begin{aligned} \min _{(\gamma _{ij})_{1 \le i,j \le n}}&\sum _{i,j=1}^{n} d'(x_i,y_j)^p \cdot \gamma _{ij} \quad \\ \text {s.t. }&\sum _{j = 1}^n \gamma _{ij} = 1 \ \text {for all } i, \ \sum _{i = 1}^n \gamma _{ij} = 1 \ \text {for all } j, \quad \\&\gamma _{ij} \ge 0 \ \text {for all } i,j. \end{aligned} \end{aligned}$$

It is a standard result in linear programming that this problem always has a solution \(\gamma _{ij} \in \{0,1\}\), \(1 \le i,j \le n\); see, e.g., the theorem in Section 6.5 of Luenberger and Ye (2008), which is essentially due to the fact that the structure of the constraint allows for a back substitution approach involving only additions and subtractions. We may therefore conclude from Theorem 1 that Equation (6) holds and that the infimum on the right hand side is attained for .

In principle, Remark 3 allows us to specialize results and algorithms for unbalanced Wasserstein metrics to TT- and RTT-metrics. However, the discrete setting we consider here is sometimes not included in the general theorems or requires a more specialized treatment. Algorithms for computing unbalanced transport plans are typically derived from balanced optimal transport algorithms; a selection can be found in Chizat (2017). The auction algorithm we use in this paper is derived from the auction algorithm used for balanced assignment problems in a similar way.

3 Barycenters with respect to the TT-metric

For data on quite general metric spaces, barycenters can formalize the idea of a center element representing the data. In the case of \(\mathfrak {N}_{\mathrm {fin}}\), we are thus looking for a center point pattern that gives a good first-order representation of a set of data point patterns \(\xi _1,\ldots ,\xi _k\). More formally, we may define a barycenter as the (weighted) q-th-order Fréchet mean with respect to \(\tau \); see Fréchet (1948).

Definition 2

For \(k \in \mathbb {N}\), let \(\xi _1,\ldots ,\xi _k \in \mathfrak {N}_{\mathrm {fin}}\) be data point patterns and \(\lambda _1,\ldots ,\lambda _k > 0\) with

\(\sum _{j=1}^k \lambda _j = 1\) be weights. Let furthermore \(q \ge 1\). Then, we call any

$$\begin{aligned} \zeta _* \in \mathop {{{\,\mathrm{\text {arg min}}\,}}}\limits _{\zeta \in \mathfrak {N}_{\mathrm {fin}}} \sum _{j=1}^k \lambda _j \tau (\xi _j,\zeta )^q \end{aligned}$$
(7)

a (weighted) barycenter of order q. If no weights are specified, we tacitly assume that \(\lambda _j = 1/k\) for \(1 \le j \le k\), leading to an “unweighted” barycenter.

Remark 4

For \(q = 2\), barycenters on general metric spaces are simply known as (empirical) Fréchet means. For \(q=1\), they are sometimes known as Fréchet medians. This comes from the fact that given \(x_1,\ldots ,x_k \in \mathbb {R}^D\), we have

$$\begin{aligned} \mathop {{{\,\mathrm{\text {arg min}}\,}}}\limits _{z \in \mathbb {R}^D} \sum _{j=1}^k \Vert x_j-z \Vert ^2 = \frac{1}{k} \sum _{j=1}^k x_j \end{aligned}$$
(8)

(the \({{\,\mathrm{\text {arg min}}\,}}\) is unique here), and that given

\(x_1, \ldots ,x_k \in \mathbb {R}\), we have

$$\begin{aligned} \mathop {{{\,\mathrm{\text {arg min}}\,}}}\limits _{z \in \mathbb {R}} \sum _{j=1}^k \Vert x_k-z \Vert = {{\,\mathrm{median}\,}}\{x_1,\ldots ,x_k\}, \end{aligned}$$
(9)

where the right hand side denotes the set of medians \(\bigl \{z \in \mathbb {R};\, \#\{j;\, x_j \le z\} = \#\{j;\, x_j \ge z\} \bigr \}\).

Remark 5

As seen in Remark 3, we may interpret \(\tau \) as an unbalanced Wasserstein metric. There has been a great deal of research on Wasserstein barycenters (in the Fréchet mean sense as above, see, e.g., Agueh and Carlier (2011) or Cuturi and Doucet (2014)), which more recently also extends to unbalanced Wasserstein metrics, see, e.g., Chizat et al. (2018) or Schmitz et al. (2018). In addition to the fact that much of the corresponding theory is not well adapted to the case of discrete input measures, with the notable exception of Anderes et al. (2016), we point out that a fundamental difference of (7) lies in the fact that we minimize over the space \(\mathfrak {N}_{\mathrm {fin}}\) of \(\mathbb {Z}_{+}\)-valued measures. This space is smaller than the space \(\mathfrak {M}_{\mathrm {fin}}\) of general finite measures, but has a more complicated structure because it decays into connected components \(\mathfrak {N}_n = \{\xi \in \mathfrak {N}_{\mathrm {fin}};\, |\xi |=n \}\) (under the TT-metric), implying, e.g., that continuous optimization procedures will not work directly.

In what follows, we always set \(p = q\) and choose this number mostly \(\in \{1,2\}\). We refer to the resulting barycenters simply as 1- and 2-barycenter or as point pattern median and point pattern mean, respectively. Point pattern medians have been introduced under the name of prototypes in Schoenberg and Tranbarger (2008) on \(\mathbb {R}\) and studied in higher dimensions in Diez et al. (2012) and Mateu et al. (2015). However, in these papers the applicability was limited to rather small data sets due to the large computation cost of \(O(n^6)\) mentioned in Remark 2.

Using the construction from Theorem 1, we may reformulate the barycenter problem as a multidimensional assignment problem, generalizing Lemma 16 in Koliander et al. (2018). Note that for the TT-metric we can add an arbitrary number of points at \(\aleph \) to both point patterns without changing the minimum in Theorem 1.

Theorem 4

For point patterns \(\xi _j = \sum _{i=1}^{n_j} \delta _{x_{ij}}\), \(j \in [k]\), let \(\tilde{n}:= \bigl \lfloor \frac{2}{k+1} \sum _{j=1}^{k} n_j \bigr \rfloor \) and \(n \ge \max \{\tilde{n}, n_j;\, 1 \le j \le k\}\). Set \(x_{ij} = \aleph \) for \(n_j+1 \le i \le n\) and \({\tilde{\xi }}_j = \sum _{i=1}^{n} \delta _{x_{ij}}\) for any \(j \in [k]\).

Then, for any \(\pi _{*,1},\ldots ,\pi _{*,k} \in S_{n}\) jointly minimizing

$$\begin{aligned} \sum _{i=1}^{n} \min _{z \in \mathcal {X}'} \sum _{j=1}^k d'(x_{\pi _j(i),j},z)^p \end{aligned}$$
(10)

the point pattern \(\zeta _*\vert _{\mathcal {X}}\) with \(\zeta _* = \sum _{i=1}^{n} \delta _{z_{i}}\), where \(z_i \in {{\,\mathrm{\text {arg min}}\,}}_{z \in \mathcal {X}'} \sum _{j=1}^k d'(x_{\pi _{*,j}(i),j},z)^p\) is a p-th-order barycenter with respect to the TT-metric.

The \(\pi _{*,1},\ldots ,\pi _{*,k} \in S_{n}\) above define n disjoint “clusters” \(\mathcal {C}_i = \{x_{\pi _{*,j}(i),j};\, 1 \le j \le k\}\), where each contains exactly one (maybe virtual) point of each point pattern. The minimization of (10) may thus be interpreted as a multidimensional assignment problem with cluster cost

$$\begin{aligned} \mathrm {cost}_*(\mathcal {C}) = \min _{z \in \mathcal {X}'} \sum _{x \in \mathcal {C}} d'(x,z)^p. \end{aligned}$$
(11)

Proof

Let us first give an upper bound on the cardinality of the barycenter. A single barycenter point can be matched with up to k points (one from each point pattern). If said point is matched with only \(\frac{k}{2}\) points or fewer, it cannot be worse to delete it. The contribution for this point in the objective function is at least \(\frac{k}{2} C\), while deleting it adds at most \(\frac{k}{2} C\) to the objective function.

So, every barycenter point should be matched with at least \(\lceil \frac{k+1}{2} \rceil \) points. The total number of points is \(\sum _{j=1}^{k} n_j\). Therefore, the number of barycenter points is bounded above by \(\tilde{n} = \bigl \lfloor \frac{2}{k+1} \sum _{j=1}^{k} n_j \bigr \rfloor \).

It is thus sufficient to fill up all the point patterns \(\xi _j\) to n points and work also with an ansatz of n points for \(\zeta \). Theorem 1 yields

$$\begin{aligned} \begin{aligned} \min _{\zeta \in \mathfrak {N}_{\mathrm {fin}}}&\sum _{j=1}^k \tau (\xi _j,\zeta )^p \\&= \min _{z_1,\ldots ,z_{n} \in \mathcal {X}'} \sum _{j=1}^k \min _{\pi \in S_{n}} \sum _{i=1}^{n} d'(x_{\pi (i),j},z_i)^p \\&= \min _{z_1,\ldots ,z_n \in \mathcal {X}'} \min _{\pi _1,\ldots ,\pi _k \in S_{n}} \sum _{j=1}^k \sum _{i=1}^{n} d'(x_{\pi _j(i),j},z_i)^p \\&= \min _{\pi _1,\ldots ,\pi _k \in S_{n}} \sum _{i=1}^{n} \min _{z_i \in \mathcal {X}'} \sum _{j=1}^k d'(x_{\pi _j(i),j},z_i)^p \end{aligned} \end{aligned}$$
(12)

and that any minimizer \(\zeta _*\vert _{\mathcal {X}}=\sum _{i=1}^{n} \delta _{z_i}\vert _{\mathcal {X}}\) on the left hand side is obtained from jointly minimizing in \(\pi _1,\ldots ,\pi _k\) and \(z_1,\ldots ,z_{n}\) on the right hand side. \(\square \)

4 Alternating clustering algorithms

Based on Theorem 4, we propose an algorithm that alternates between minimizing

$$\begin{aligned} \sum _{j=1}^k \sum _{i=1}^{n} d'(x_{\pi _j(i),j},z_i)^p \end{aligned}$$
(13)

in \(\pi _1,\ldots ,\pi _k \in S_{n}\) and in \(z_1,\ldots ,z_{n} \in \mathcal {X}'\) until convergence. Such an algorithm terminates in a local minimum of (13) after a finite number of steps, because (13) can never increase and the minimization in the permutations is over a finite space.

Since this underlying idea is close to the popular k-means clustering algorithm, we named the main function in the pseudocode and in the actual implementation kMeansBary (note, however, that n plays the role of k in our notation). Similar alternating algorithms in the context of Wasserstein-2 barycenters for finitely supported probability measures have been proposed in Cuturi and Doucet (2014), Borgwardt (2019) and del Barrio et al. (2019). See Sect. 5, where we compare results between kMeansBary and Algorithm 2 in Cuturi–Doucet.

In what follows, we present pseudocode along with the underlying ideas and explanations for two versions of the kMeansBary-algorithm that we dub original and improved. Here, “improved” refers to the fact that we cut down on certain computation steps in order to save runtime. We will see in Sect. 5 that this comes essentially without any performance loss.

User-friendly implementations of both algorithms are publicly available in the R-package ttbary; see Müller and Schuhmacher (2019).

4.1 Our original kMeansBary algorithm

The pseudocode for the basic alternating strategy described above is given in Algorithm 1. We have introduced a stopping parameter \(\delta \) to allow termination before the local optimum is reached. Since we are not interested in the actual clustering, but only in the position of the centers \(z_1,\ldots ,z_{n}\), it seems very unlikely (though possible) that the solution changes substantially once the cost decrease has become very small. What is more, such a change might be spurious due to rounding errors in the data or when we use an approximation method for optimizing in the centers. Note also that we can always set \(\delta \) to the smallest representable positive floating-point number to ensure convergence to the local optimum.

figure a

The minimization with respect to \(\pi _1,\ldots ,\pi _k\) is performed by optimPerm. This function computes an optimal matching between the current

and each data point pattern in pplist, using an alternating version of the auction algorithm with \(\varepsilon \)-scaling; see Remark 1 and Bertsekas (1988) for more details. We output the cost of the current matching and an \(n \times k\) matrix perm, whose j-th column specifies the order in which the points of the j-th data pattern are matched to \(z_1,\ldots ,z_{n}\). For greater efficiency, we save auxiliary information (price and profit vectors) and use it for initializing the auction algorithm when calling it again with the same data point pattern.

For practical purposes, we have split up the minimization with respect to \(z_1,\ldots ,z_{n} \in \mathcal {X}'\) into a function optimBary that optimizes the positions within \(\mathcal {X}\) and functions optimDelete and optimAdd that optimize which of the \(z_i\) to move from \(\mathcal {X}\) to \(\aleph \) and from \(\aleph \) to \(\mathcal {X}\), respectively. We discuss details of these functions under the separate headings below.

In addition to the outputs of the various functions shown in Algorithm 1, we also keep information on the quality of each match of points up to date. We call the match of a \(z_i\) with a data point \(x_{i'j}\)

$$\begin{aligned} \begin{aligned} {\textit{happy}}&\text { if } z_i, x_{i'j} \in \mathcal {X}\text { and } d'(z_i,x_{i'j}) < 2^{1/p} C\\ {\textit{miserable}}&\text { if } z_i, x_{i'j} \in \mathcal {X}\text { and } d'(z_i,x_{i'}) = 2^{1/p} C\\&\text {or if } z_i = \aleph , x_{i'j} \in \mathcal {X}\\ \textit{to } \aleph&\text { if } x_{i'j} = \aleph . \end{aligned} \end{aligned}$$

Note that a miserable match is worst possible in the sense that \(\mathrm {cost}(\mathcal {C}_i) = \sum _{x \in \mathcal {C}_i} d'(x,z_i)^p\) for center \(z_i\) cannot increase if \(x_{i'j}\) is replaced by any other \(x \in \mathcal {X}'\).

4.1.1 Details on optimBary

The purpose of this function is to find for each \(z_i \in \mathcal {X}\) (i.e., not currently at \(\aleph \)) a location in \(\mathcal {X}\) that minimizes \(\mathrm {cost}(\mathcal {C}_i)\) for its current cluster \(\mathcal {C}_i = \{x_{\pi _{j}(i),j};\, 1 \le j \le k\}\). This amounts to a more traditional location problem in \(\mathcal {X}\), except that it is typically made (much) more difficult by the fact that we have to truncate distances at \(2^{1/p} C\).

Note that any cluster points at \(\aleph \) can be ignored because they always contribute the same amount to the cluster cost, no matter where the center lies. The same is true for individual points that have a much larger distance than \(2^{1/p} C\) from the bulk of the points. However, there are countless scenarios with (groups of) points being around distance \(2^{1/p} C\) apart from one another for which the optimization of the cluster cost becomes a difficult optimization problem (piecewise smooth on a space that is fragmented in complicated ways).

As a simple heuristic that works well in cases where we do not have to cut too many distances (i.e., C is not too small), we suggest to ignore all points that are at the maximal \(d'\)-distance \(2^{1/p} C\) from the current \(z_i\) when computing the new \(z_i\). Note that in this way the cluster cost can never increase.

figure b

Algorithm 2 gives corresponding pseudocode. The function optimClusterCenter handles the location problem for the untruncated metric d on \(\mathcal {X}\). If for example \(\mathcal {X}= \mathbb {R}^D\) equipped with the Euclidean metric and \(p=2\), Equation (8) implies that optimClusterCenter simply has to take the (coordinatewise) average of all happy points. The case \(p=1\) can be tackled with higher computational effort by approximation via the popular Weiszfeld algorithm; see Weiszfeld (1937).

As a further instance, which we will take up in Sect. 6, we consider the situation where \(\mathcal {X}\) is a simple graph (VE) equipped with the shortest-path distance and \(p=1\). It can be shown that in this case the location problem in \(\mathcal {X}\) is solved by an element \(z_i\) of \(V \cup \mathcal {C}_i\), i.e., either a vertex of the graph or any data point, see Hakimi (1964). We therefore proceed by first computing the distance matrix between all these points, which is then used for the entire algorithm. Such shortest-path distance computations in sparse graphs with thousands of points can be performed in (at most) a few seconds by various algorithms, see Chapter 25 in Cormen et al. (2009) and the concrete timing in Sect. 6.2. It is now easy to implement the function optimClusterCenter. For a given set of happy points of a cluster \(\mathcal {C}_i\), pick the corresponding columns in the distance matrix, add them up and determine the minimal entry of the resulting vector. If there are several such entries, which due to choosing \(p=1\) can happen quite frequently, we pick one among them uniformly at random. The index of the obtained entry identifies the center point \(z_i\).

Precomputing the distance matrix between all points of \(V \cup \mathcal {C}_i\) in the graph case has the additional advantage that no distances have to be computed in the optimPerm step. It is, on the other hand, the main bottleneck of the procedure and may not be feasible in situations with very large graphs and data sets. In this case, we can resort to one of the various heuristics available, such as the single and multi-hub heuristics proposed (in principle) in Bandelt et al. (1994) and Koliander et al. (2018).

4.1.2 Details on optimDelete

This function deletes (i.e., moves to \(\aleph \)) any \(z_i \in \mathcal {X}\) for which this operation decreases \(\mathrm {cost}(\mathcal {C}_i)\).

We denote by \(k_{\mathrm {happy}}\), \(k_{\mathrm {miser}}\) and \(k_{\aleph }\) the numbers of data points in \(\mathcal {C}_i\) that are happy, miserable and at \(\aleph \), respectively. Write furthermore \(c_{\mathrm {happy}}\) for the total cost of matching the happy points to \(z_i\). If \(z_i\) stays in \(\mathcal {X}\), the cluster incurs an overall total cost of

$$\begin{aligned} c_{\mathrm {happy}} + k_{\mathrm {miser}}\cdot 2C^p + k_{\aleph }\cdot C^p \end{aligned}$$

as opposed to

$$\begin{aligned} k_{\mathrm {happy}}\cdot C^p + k_{\mathrm {miser}}\cdot C^p \end{aligned}$$

if we delete \(z_i\). Subtracting \(k_{\mathrm {miser}}\cdot C^p\) from both expressions, this leads to the deletion condition

$$\begin{aligned} k_{\mathrm {happy}}C^p < c_{\mathrm {happy}} + (k-k_{\mathrm {happy}}) C^p. \end{aligned}$$

Since \(c_{\mathrm {happy}} \ge 0\), a sufficient condition for deletion is \(2 k_{\mathrm {happy}}< k\). We use this as a quick pretest, which allows us to avoid computing \(c_{\mathrm {happy}}\) sometimes. The full deletion procedure is presented in Algorithm 3.

figure c

4.1.3 Details on optimAdd

This function adds (i.e., moves to \(\mathcal {X}\)) any \(z_i \in \aleph \) for which it finds a way to do so that decreases \(\mathrm {cost}(\mathcal {C}_i)\). Pseudocode is given in Algorithm 4.

As a compromise between computational simplicity and finding a good location in \(\mathcal {X}\), we first sample a proposal location \(\tilde{z}\) uniformly from all miserable data points (i.e., points from any cluster that are currently in a miserable match with their center). Before we consider moving \(z_i\) to \(\tilde{z}\), we rebuild the cluster \(\mathcal {C}_i\) in such a way that this move has a better chance of being accepted.

The corresponding procedure is performed by the optimizeCluster-function in the pseudocode: For each data pattern, \(\xi _j\) pick the miserable point that is closest to \(\tilde{z}\) (if it has any) and exchange it with the corresponding point \(x_{\pi _j(i),j}\) that is currently in \(\mathcal {C}_i\). Since the point coming from the other cluster was miserable before, the cost of that cluster cannot increase by this exchange. The cost of the cluster \(\mathcal {C}_i\) can increase only if it loses a point located at \(\aleph \) in the exchange. In this case, the cost increases by \(C^p\), which is compensated by the fact that the cost of the other cluster must decrease, either from \(2 C^p\) to \(C^p\) if its center is in \(\mathcal {X}\), or from \(C^p\) to 0 if its center is at \(\aleph \). Thus, the total cost remains the same, but \(\mathcal {C}_i\) has an additional point in \(\mathcal {X}\) now, which makes the successful addition of \(z_i\) to \(\mathcal {X}\) more likely.

To further decrease the prospective cluster cost after addition, we update the proposal \(\tilde{z}\) by recentering it in its new cluster using the appropriate optimClusterCenter-function introduced in optimBary (applied to the set of points of the new cluster that are in a happy match with \(\tilde{z}\)).

Finally, check whether the cost of the new cluster based on the updated \(\tilde{z}\) is smaller than the same cost based on \(z_i = \aleph \), which is \(C^p\) times the number \(k_{\mathcal {X}}\) of non-\(\aleph \) points in the new cluster. Set \(z_i\) to \(\tilde{z}\) if this is the case.

figure d

4.2 An improved kMeansBary algorithm

For obtaining an algorithm with a reduced computational cost, we cut down on steps that are costly, but are not expected to influence the resulting local optimum in a decisive way. Since for now we treat the location problem at the cluster level (performed by optimBary) as very general, allowing a wide range of metric spaces \((\mathcal {X},d)\), we focus here on saving computations in the functions optimPerm, optimDelete, and optimAdd.

We have realized that by far the most additions and deletions of points take place in the first two iterations of the original algorithm (see also Fig. 3). Especially checking for addition of points is costly and after the first few iterations very rarely successful. Therefore, we limit such checking henceforward to the first \(N_{\mathrm {del/add}} = 5\) iteration steps. Some further heuristics could be applied in optimAdd, but the gain in computation time is not so large and they can significantly change the outcome, which is why we decided against implementing them.

In optimPerm, we cannot avoid doing matchings. However, the auction algorithm we use allows to solve a relaxation of the problem by stopping the \(\varepsilon \)-scaling method early. In general, the auction algorithm with \(\varepsilon \)-scaling based on a decreasing sequence \((\varepsilon _1,\ldots ,\varepsilon _l)\) returns successively improved solutions that are guaranteed to lie within \(n\varepsilon _i\) of the optimal total cost after the i-th step, see Bertsekas (1988, Proposition 1). By representing rescaled distances as integers in \(\{0,1,\ldots ,10^9\}\), an optimal matching is obtained in the l-th step if \(\varepsilon _l < 1/n\). Our improved algorithm is based on the same \(\varepsilon \)-vector as the original algorithm, which has components \(\varepsilon _i = \frac{1}{n+1} 10^{l-i}\), \(1 \le i \le l\), where l is chosen in such a way that \(10^7 \le \varepsilon _1 < 10^8\). As a first improvement, we use the subsequence \((\varepsilon _{a_{\texttt {it}}}, \varepsilon _{a_{\texttt {it}}+1}, \ldots ,\varepsilon _{b_{\texttt {it}}})\), where a and b are prespecified vectors of indices \(\in \{1,2,\ldots ,l\}\). A simple choice for a and b that tends to decrease the runtime noticeably is \(a_{\texttt {it}} = 1\) and \(b_{\texttt {it}} = \min \{\texttt {it},l\}\). Pseudocode for this is presented in Algorithm 5.

In practice, we settled for a somewhat more sophisticated improvement. We choose the vectors \(a = (1,1,1,3,3,3,\ldots ,3,4)\) and \(b = (1,2,3,4,6,8,\ldots ,2\lfloor \frac{l-1}{2}\rfloor ,l)\), and we use the sequence \((\varepsilon _{a_{j}}, \varepsilon _{a_{j}+1}, \ldots ,\varepsilon _{b_{j}})\), where \(j=\texttt {it}\) for \(\texttt {it} \in \{1,2,3\}\), and then j is increased by 1 each time the algorithm would otherwise converge or if the cost increases (which can only happen as long as the matchings are not optimal).

This strategy was chosen after analyzing the calculations of the algorithm with respect to the time each calculation takes. In the first two to three iterations, there are a lot of changes in the positions of the barycenter points. Especially in the first iteration, many points are deleted and added, which completely changes the assignments. Therefore, we have to begin the assignment calculation with \(\varepsilon _1\) and to get more sensible results we get more precise with each of the first three iterations. After three iterations, there are usually no big changes to the barycenter anymore, so we can reuse the assignment from the iteration before as a sensible starting solution and can omit \(\varepsilon _1\) and \(\varepsilon _2\) in return. Leaving out the first entries of \(\varepsilon \) too soon increases the runtime. Every time the algorithm converges, but has no guaranteed optimal assignment (i.e., \(b_j < l\)), j is increased by 1, meaning that the next two entries of \(\varepsilon \) are used too, until the end of \(\varepsilon \) is reached. Then, we can safely leave out the first three entries of \(\varepsilon \) without increasing the runtime, because at this point the assignments from one iteration to the next only change very little.

figure e
Fig. 3
figure 3

Stepwise evolution of the barycenter for \(k=80\), \(m_{\#} = 100\). In the first iteration, 32 points are deleted and 25 added. After that only movements take place

4.3 Practical aspects

As it turns out, the upper bound \(\tilde{n}\) on the cardinality of the barycenter from Theorem 4 is often far too large in practice. For efficiency reasons, we typically run the algorithm with a number \(n \ge \max \{n_j;\, 1 \le j \le k\}\) that is much smaller than \(\tilde{n}\). We generate a starting point pattern

by picking \(\frac{1}{k} \sum _{j=1}^k |\xi _j |\) points uniformly at random from the underlying observation window. In a first step, all point patterns are filled to n points by adding points at \(\aleph \). Then, Algorithm 1 or 5 is run.

Figure 3 shows a typical run of Algorithm 1 in the case of an i.i.d. sample \(\xi _1, \ldots , \xi _k\) of point patterns in \(\mathbb {R}^2\) generated from a similar distribution as studied in Sect. 5. We use Euclidean distance and \(p=2\). The current barycenter is marked by blue points. Typically, the random starting point pattern is not a good approximation to the resulting barycenter. Therefore, many points are deleted in the first iteration. Many other ones are added at or moved to more cost efficient spots. Regardless of the starting pattern, the algorithm typically attains a reasonably looking configuration after a single iteration. After that hardly any points are added or deleted any more. The algorithm mostly moves a few individual barycenter points around each time.

Fig. 4
figure 4

20 point patterns with 20 points each from the three different center scenarios \(N=5, 10, 15\) for \(\sigma =0.05\)

5 Simulation study

In this section, we present a simulation study for evaluating the algorithms described in Sect. 4 for point patterns \(\xi _1,\ldots ,\xi _k\) in \(\mathbb {R}^2\) using squared Euclidean cost.

Unfortunately, it is not feasible for larger data examples to compute the actual barycenter as a ground truth. To illustrate this, consider the special case where all point patterns have the same cardinality n and are contained in a subset of \(\mathbb {R}^2\) of radius C. Assume further that we know that there is a barycenter that also has cardinality n (which need not be the case). In this situation, it is easy to see that instead of solving the minimization problem (13), we only need to minimize

$$\begin{aligned} \sum _{j=1}^k \sum _{i=1}^n \Vert x_{\pi _j(i),j}-z_i \Vert ^2 \end{aligned}$$
(14)

in \(\pi _1,\ldots ,\pi _k \in S_n\) and \(z_1,\ldots ,z_n \in \mathbb {R}^2\). This is the assignment version of the problem of finding a barycenter of the discrete probability measures \(\frac{1}{n}\xi _1, \ldots , \frac{1}{n}\xi _k\) with respect to the Wasserstein metric \(W_2\). An exact algorithm for this problem can be found in Anderes et al. (2016) and has been tremendously improved in Borgwardt and Patterson (2018). Nevertheless, the computation times increase still rapidly in the problem size and reach minutes to hours for problem sizes well smaller than our smallest examples below.

Table 1 Original algorithm. Maximum relative deviations from the minimum objective function value among ten starting solutions given in percent. Means taken over 100 instances, with 0.05- and 0.95-quantiles in parentheses. The first block of four rows corresponds to the deterministic cardinality, the second block to the high-variance cardinality
Table 2 Improved algorithm. Maximum relative deviations from the minimum objective function value of the original algorithm (both based on the same ten starting solutions) given in percent. Means over 100 instances, with 0.05- and 0.95-quantiles in parentheses. The first block of four rows corresponds to the deterministic cardinality, the second block to the high-variance cardinality
Table 3 Original algorithm. Total times in seconds for ten runs with random starting patterns. Means over 100 instances, with 0.05- and 0.95-quantiles in parentheses. The first block of four rows corresponds to the deterministic cardinality, the second block to the high-variance cardinality
Table 4 Improved algorithm. Total times in seconds for ten runs with random starting patterns. Means over 100 instances, with 0.05- and 0.95-quantiles in parentheses. The first block of four rows corresponds to the deterministic cardinality, the second block to the high-variance cardinality

Since we are not able to compare the results of our algorithm to the actual barycenter for larger examples, we assess the range of the final objective function values. In addition, we evaluate the time performance of the default algorithm and compare both objective function values and timings to the improved algorithm.

As problem instances, we created sets of k point patterns in \(\mathbb {R}^2\) having mean cardinality of \(m_{\#}\) in each pattern. The cardinalities \(n_j\), \(j \in [k]\), of the individual point patterns were generated by one of the following methods:

  1. (i)

    by setting \(n_j = m_{\#}\)(deterministic cardinality)

  2. (ii)

    by sampling \(n_j\) from a binomial distribution with mean \(m_{\#}\) and variance \(\approx 1\)

    (low-variance cardinality)

  3. (iii)

    by sampling \(n_j\) from a Poisson distribution with parameter \(m_{\#}\)(high-variance cardinality)

The points were distributed according to a balanced mixture of \(N \in \{5,10,15\}\) rotationally symmetric normal densities centered at fixed locations in \([0,1]^2\) and having standard deviation \(\sigma \in \{0.05,0.1,0.2\}\). Figure 4 gives examples under the three center scenarios for \(k=20\), deterministic cardinality \(n_j = m_{\#} = 20\) and \(\sigma =0.05\).

Fig. 5
figure 5

Mean objective function values over all instances as function of \(\sigma \) for different N

We chose five \((k,m_{\#})\) pairs (20, 20), (20, 50), (50, 20), (50, 50) and (100, 100), which in combination with N varying in \(\{5,10,15\}\), \(\sigma \) in \(\{0.05, 0.1, 0.2\}\) and the three cardinality distributions yield a total of \(5 \times 3^3 = 135\) scenarios. We created 100 instances for each scenario.

Our algorithms from Sect. 4 were run from ten starting solutions whose cardinalities matched the mean number of data points and whose points were sampled uniformly at random from \([0,1]^2\). In a pilot experiment, this tended to give somewhat better local minima than starting from a random sample of all data points combined. The starting point patterns were independently chosen for each instance, but the same for both algorithms. In all cases, the penalty C was set to 0.1.

Tables 1, 2, 3 and 4 summarize the performance of our two algorithms. For the clarity of presentation, we leave out the “middle values” \(N=10\) and \(\sigma =0.1\), as well as the low-variance cardinality distribution. The corresponding performance results lie up to minor random fluctuations between the values shown. The original purpose of including the low-variance cardinality case was to detect whether a slight departure from equal cardinalities would cause substantial differences in the performance. As it turned out, this was not the case.

We first consider the original algorithm presented in Sect. 4. Table 1 gives the maximum relative deviation from the minimum \(d_{\min }\) of the resulting objective function values among the ten starting solutions, i.e., \(\frac{d_{\max }-d_{\min }}{d_{\min }}\). We can see that the maximal objective function value among the ten runs rarely exceeds the minimum value by more than 5%. This percentage is rather higher for the deterministic and low-variance cardinalities and when clusters in the (unmarked) superposition of the point patterns are well separated (small N and \(\sigma \)). This may well be explicable by the fact that typically many pairs can be matched over short distances in these situations such that wrong clustering decisions come typically at a higher relative cost. Figure 5 supports this by showing that the total objective function values within each problem size are lower for well separated clusters.

Fig. 6
figure 6

Barycenters for one of our simulated data sets (20 patterns with 20 points each). From left to right: Cuturi–Doucet algorithm without constraints, Cuturi–Doucet algorithm with (maximally) 20 support points and equal masses, typical result from kMeansBary algorithm based on a single start. The areas of the disks are proportional to the masses

A further smaller experiment following up on the scenarios that exhibited the poorest performance for ten starting patterns showed that the margin of 5% increases to 8% when basing the maximum relative deviation from the minimum on 100 starting patterns.

For the improved algorithm from Sect. 4.2, we compute the maximum relative deviation of its objective function values from the minimum \(d_{\min }\) of the corresponding values of the original algorithm, i.e., \(\frac{d^{*}_{\max }-d_{\min }}{d_{\min }}\), where \(d^{*}_{\max }\) is the maximum of the objective function values of the improved algorithm. As seen in Table 2, the performance is no worse than for the original algorithm in spite of the reduced amount of computations performed.

We finally turn to the computation times. We present the total runtimes in seconds for the ten runs with different starting patterns. This corresponds to the realistic situation of selecting as (pseudo-)barycenter the solution with the smallest local minimum in ten runs. It also provides some more stability for the means and quantiles given in Tables 3 and 4.

Table 3 gives the runtimes for the original algorithm. We see that individual runs of as large scenarios as 100 patterns with 100 points on average only take a few seconds.

From Table 4, we see that the runtimes for the improved algorithm are even considerably lower, and for some of the larger problems they have less than half of the original runtimes (at virtually no loss with regard to the objective function value as we have seen before). It is to be expected that this ratio becomes even smaller if the problem size is further increased.

Let us finally compare our algorithm to an algorithm that treats point patterns as empirical measures and tackles the Wasserstein-2 barycenter problem for these measures. As noted in the introduction, it is not realistic to treat even our smallest examples with exact algorithms for this problem. A selection of approximate algorithms can be found in Peyré and Cuturi (2019). See also the alternating algorithm in Borgwardt (2019), which includes a factor-2 performance guarantee. For our comparison, we choose Algorithm 2 in Cuturi and Doucet (2014), which alternates between solving transport problems and using gradient descent to calculate a discrete barycenter with a prescribed maximal number of support points. It allows to restrict the set of weights for the support points to a closed convex set \(\varTheta \) and thus provides an approximate solution of the problem (14) if we set \(\varTheta = \{(1/n,\ldots ,1/n)\}\).

We thank Florian Heinemann for allowing us to use his R implementation with underlying C++ code of this algorithm. Figure 6 shows an example that compares the Cuturi–Doucet algorithm without constraint using the theoretically maximal number of support points (according to Anderes et al. (2016)), the Cuturi–Doucet algorithm with full constraint and our algorithm.

To evaluate how the algorithms perform on our objective function (13), we have run both the fully constrained Cuturi–Doucet algorithm and our kMeansBary algorithm (with a single starting value) on the smallest scenarios used in the simulation study. These are 900 instances of 20 patterns with exactly 20 points (deterministic cardinality) and 900 instances of 20 patterns whose cardinalities are Poisson with mean 20 (high-variance cardinality).

We report the ratio of the total TT-objective function (13) between the solution of kMeansBary and the Cuturi–Doucet algorithm, where again \(C = 0.1\). For the case of deterministic cardinality, the ratio was 0.729 on average, with a minimum of 0.554 and a maximum of 0.871. For the high-variance cardinality, the results are very similar with an average of 0.732 and a minimum of 0.541 and a maximum of 0.866. So on average the objective function values attained by the point patterns returned by the Cuturi–Doucet algorithm are about \(37\%\) larger than the ones attained by kMeansBary. This increase is reflected in the example in Fig. 6.

At the same time, the average runtime of the Cuturi–Doucet algorithm is more than twice the runtime of kMeansBary. This may well be due to the fact that the former is not particularly optimized for the constrained setting we use.

Overall, our comparison yields that the Cuturi–Doucet algorithm is not well suited for our problem, which is simply due to the fact that this algorithm was designed for a somewhat different problem. We expect similar results when comparing with other algorithms that compute (approximate) Wasserstein-2 barycenters.

6 Applications

The following analyses are all performed in R, see R Core Team (2019), with the help of the package spatstat, see Baddeley et al. (2015).

Fig. 7
figure 7

Barycenters of weekly street thefts in the localidad of Kennedy in Bogotá. The cardinalities are 48, 53, 52, 52, 80 and 175, respectively

6.1 Street theft in Bogotá

We investigate a data set of person-related street thefts in Bogotá, Colombia, during the years 2012–2017. This data set is part of a huge data set based on a large number of types of crimes collected by the Dirección de Investigación Criminal e Interpol (DIJIN), a department of the Colombian National Police. We acknowledge DIJIN and the General Santander National Police Academy (ECSAN) for allowing us to use this data. In particular, the cases of street theft in Bogotá consist of muggings, which involve the use of force or threat, as well as pickpocketing. They do not include theft of vehicles, breaking into cars, etc. Here, we focus on the locality of Kennedy, a roughly \(7.5\,\mathrm {km} \times 7.5\,\mathrm {km}\) administrative ward in the west of the city, because this area is considered by the police as being more dangerous with a higher average number of crime events compared to the rest of Bogotá. The total number of street thefts in Kennedy for the considered period is 25840.

Since a plot of weekly numbers of crimes reveals no clear seasonal pattern and since weekly patterns (and hence their barycenters) are of a good size to be interpreted graphically, we compute yearly barycenters for these weekly patterns. Thus, we may think of a barycenter pattern as representing a “typical week” of street thefts in the corresponding year. As penalty parameter, we chose 1000 m. Since street information was not directly available to us, we chose Euclidean distance as a metric and set \(p=2\) to be able to relate to our simulation results in the previous section.

Each barycenter was computed based on 100 starting patterns with cardinalities regularly scattered over the integer numbers between the 0.45 to the 0.7 quantiles of the weekly number of data points for the corresponding year. We chose this somewhat asymmetrically around the median, because the mean number of thefts (the theoretical number of points in the barycenter if the penalty becomes large) was typically quite a bit to the right of the median, and also because our algorithm is somewhat better at deleting than at adding points.

Figure 7 depicts the obtained barycenters, which except for the last pattern have cardinalities just slightly below the average weekly numbers of muggings of 51.7, 57.6, 52.8, 54.4, 82.5 and 196.9, respectively. The barycenters for the years 2012–2015 seem to be largely similar. Then, in 2016 we start seeing patterns of denser structures forming along a line to the west and a center in the south-east of Kennedy. These can be actually identified as a main street and a major intersection in the densely populated parts of Kennedy.

Fig. 8
figure 8

Barycenters of cases of assault for different districts of Valencia in winter and in summer. The numbers indicate multiplicities if there are several points at a single location. The cardinalities of the barycenters are 68, 69, 88, 30 (winter) and 103, 79, 74, 24 (summer)

6.2 Assault cases in Valencia

As a second application, we analyze cases of assault in Valencia, Spain, reported to the police in the years 2010–2017. Since the addresses of the assaults and the street network are available, we treat this data as point patterns on a graph using shortest-path distance and \(p=1\). We acknowledge the local police in Valencia city together with the 112 emergency phone that kindly provided us the data after cleaning and removing any personal information.

We split up the graph and analyze the four central districts of Ciutat Vella, Eixample, Extramurs and El Pla del Real separately. For this, we assigned each assault case to its district, but added also streets from other districts at the boundary, in order to enable more natural shortest-path computations. The north-south and east-west extensions of the districts vary roughly between 1.6 and 3.3 km.

In the time domain, we split up the assault data by year and season into seven winter patterns (data from December, January and February) and eight summer patterns (data from June, July and August), discarding for the present analysis data from the intermediate seasons, as well as from January and February 2010 and December 2017. We then computed barycenters per main season and district, obtaining “typical” assault patterns for summer and winter for each of the four districts considered, see Fig. 8. The penalty was chosen as 800 m with respect to the shortest-path distance.

As mentioned in Sect. 4.1 when describing the subroutine optimBary that finds cluster centers on networks, we can calculate all distances that are relevant to the algorithm beforehand. For this, we use the corresponding functionality built into the linnet objects in spatstat. On a standard laptop with a 1.6 GHz Intel i5 processor, the computation took only about four seconds for the largest data set, which is Ciutat Vella in summer with a total of 2494 vertices (1676 street crossings plus 818 data points).

For the starting patterns in each district in summer and in winter, we chose n random points, where n ranged from 0.8 times to 1.15 times the median cardinality of the data point patterns. Since our present implementation of the kmeansbary algorithm on graphs runs without optimDelete and optimAdd steps, we based each barycenter on a large sample of 500 starting patterns for each n. This resulted in an overall total of \(101500\) calls to our algorithm for the eight scenarios, which on average took 0.57 seconds each, using the precomputed distance matrices. One calculation in the largest setting (Ciutat Vella in summer) takes about 0.82 seconds and in the smallest (El Pla del Real in summer) about 0.08 seconds. The increase in the objective function was only up to 1% when decreasing the total number of calls to our algorithm by a factor of 20, resulting in a total computation time of well under one hour.

Due to the choice of \(p=1\), it happens quite frequently that there are several optimal centers for some of the clusters obtained after convergence of the kmeansbary algorithm. In this case, we take the average of their coordinates and project the result back onto the graph in order to obtain a somewhat more balanced result. The resulting point does not necessarily realize the same cluster cost as the original center points, but on a real street network it is not to be expected that the cost becomes considerably worse. In fact, for the data considered, the results hardly differed at all.

Considering the barycenters in Fig. 8, it seems that there are no very clear effects of the season on the assaults. Nevertheless, we may discern a number of differences between summer and winter in the four districts, which even in this relatively small data set would be considerably harder to spot in a plot of the raw data.

In the first district (Ciutat Vella), there are substantially more assaults in summer, but their spatial distribution in the barycenter is more or less similar. In Eixample, we see a concentration of assault cases in summer in the Barrio Ruzafa in the southern half of the district, whereas cases are more or less equally spread in winter. A notable feature is the occurrence of 23 barycenter points at a single crossing in summer and 7 points at the same crossing in winter with further points close by. This is due to the cul-de-sac visible in Fig. 8, which in reality forms sort of a backyard that makes the area easy for assaults, especially in summer time when there are more people (especially tourists) moving around those parts of the city. The spot is well-known to the police and in recent years the number of assaults has decreased due to police interventions. The barycenters clearly reflect this (former) assault hot spot.

In the district of Extramurs, both barycenters are more or less spread over the whole district, with two clusters of assaults occurring in the east and south. Both clusters are much more pronounced in winter. In the district of El Pla del Real, there is some concentration in the winter month in the east and south-east. Apart from that the only noticeable difference is that there are substantially more assaults in winter than in summer, which may well be related to the fact that this is a popular student district.

7 Discussion and outlook

In this paper, we have introduced the p-th-order TT- and RTT-metrics, which allow us to measure distances between point patterns in an intuitive way, generalizing several earlier metrics. We have investigated q-th-order barycenters with respect to the TT-metric and presented two variants of a heuristic algorithm. These variants return local minimizers of the Fréchet functional that mimic properties of the actual barycenter well and attain consistent objective function values. They are computable in a few seconds for medium-sized problems, such as 100 patterns of 100 points.

For the proof of Theorem 4, it was necessary to set \(p=q\). While such a choice may seem natural, we point out that due to the separate interpretations of p as the order for matching points in the metric on \(\mathfrak {N}_{\mathrm {fin}}\) (higher p tends to balance out the matching distances) and q as the order of the empirical moment in \(\mathfrak {N}_{\mathrm {fin}}\), it may well be desirable to combine \(p \ne q\).

In the present paper, we have only dealt with the descriptive aspects of barycenters. It is thus clear that our applications in Sect. 6 can only be seen as explorative studies. In order to determine whether differences between group barycenters are statistically significant, we need to take the distribution of the point patterns around their barycenters into account and perform appropriate hypothesis tests.

Fortunately, the Fréchet functional (7) provides us with a natural quantification of scatter around the barycenter. For \(q=2\), it is quite common to refer to

$$\begin{aligned} \mathrm {Var}(\xi _1,\ldots ,\xi _k) = \min _{\zeta \in \mathfrak {N}_{\mathrm {fin}}} \frac{1}{k} \sum _{j=1}^{k} \tau (\xi _j,\zeta )^2 \end{aligned}$$

as (empirical) Fréchet variance, due to Equation (8). Detailed asymptotic theory for performing an analysis of variance (ANOVA) in metric spaces based on comparing Fréchet variances has been recently developed in Dubey and Müller (2019a). The application and adaptation of this theory for the point pattern space and an investigation of the performance of our heuristic algorithm in this context will be the subject of a future paper.

Based on the computation of barycenters further more advanced procedures in statistics and machine learning become possible. This includes barycenter-based dimension reduction techniques, such as Wasserstein dictionary learning, see Schmitz et al. (2018), and functional principal component analysis of point patterns evolving in time, see Dubey and Müller (2019b).