1 Introduction

A common task in physics consists of comparing the predictions of models; be it to available experimental data, or among themselves in the absence of such data. It is also common for the models to differ only in the values of some parameters. When measurements exist, a global fit may be used to determine the preferred values of these parameters. The data, however, contain information beyond that which can be summarized by the preferred parameter set, including confidence intervals as well as tensions between different measurements. When the models are compared to each other, different goals can be achieved by mapping regions of parameter space onto regions of observable space. These goals may include finding benchmark points for future experimental study; investigating the level at which the models can be probed; or simply understanding the correlations between parameters that are implied by the full set of observables.

This general task can be described as a partitioning of parameter space based on the similarity of the model predictions, and clustering algorithms are a useful tool for this purpose as demonstrated in [1,2,3]. Generally speaking, the aim of clustering is to group points in a data set based on their similarity. Analyzing the clusters (or corresponding representative benchmark points) can then reveal the main structures and features in the data without looking at all individual data points.

A useful analogy is to think of a global fit as a clustering problem that partitions the parameter space based on experimental observations into bins in \(\chi ^2\). The “clusters” in this case would be the selected confidence intervals and plotting them helps us understand what the data implies for the parameters. For example, the constraints on one of the parameters might depend on the value of a second one. In the simplest case, this could reflect a linear correlation between the two variables. This is a useful but incomplete summary of all available information. Supplementing the fit with other types of clustering can reveal additional features, as we demonstrate in the application sections. The mapping of clusters between parameter and observable space can help understand tensions between observables; clusters in parameter space can reveal correlations between parameters; the partitioning of parameter space anticipates the impact of improved measurements, among others.

The work of [1,2,3] focused for the most part on the identification of a set of benchmark points to be used in future studies. However, a main asset of clustering is that the interpretation of the outcome often provides new insights. In our case, this means revealing information about the interplay between parameter and observable space, and a detailed examination of patterns in the predictions for the considered observables. This is best achieved by combining an interactive selection of the method details with a comprehensive visualization of the outcome. Our work presents the tool pandemonium, with a graphical user interface that allows a choice for the clustering settings and presents them with multiple displays of the outcome, providing a better understanding of how the clusters partition the parameter space and why. Thus, we can think of clustering as part of an exploratory workflow where one can try different clustering setups and study differences in resulting group assignment. If we also understand the implication of the selected settings, this comparison can reveal important aspects of the data.

When applied to global fits, the information obtained with clustering is complementary to what we learn from the fit: whereas the fit reduces a large number of observables in the context of a model to a best fit point and its associated confidence intervals in the model parameter space; our clustering approach provides a map to understand how different observables induce different constraints on the parameter space. This is most instructive if the study is restricted to a subset of observables with the most impact. These can be chosen to be the ones driving the fit away/toward a preferred model; those which introduce tensions in the global fit; and so on. Such a subset can be preselected, if it exists, for example with the metrics developed in [4], as we do in Sect. 4. Alternatively, one can start with the complete set of observables and use the tools provided in the interface to select a subset. This reduction in the number of observables (or parameters) is necessary in order to concentrate on the most relevant features and is akin to dimension reduction via feature selection in machine learning.

Our paper is organized as follows. In Sect. 2, we define coordinates in observable space that will serve as the objects to be clustered, and associate with them distance functions. We then provide an introduction to clustering and different linkage methods. In Sect. 3, we describe the different features of the graphical interface and illustrate how they can guide the interpretation of the complex interplay between parameter and observable spaces.

In Sect. 4, we discuss the case of the so-called neutral B anomalies in the decay modes \(B\rightarrow K^{(\star )} \ell ^+\ell ^-\). In this system, multiple observables in the angular distributions and branching ratios have been compared to the predictions of the standard model (SM) and global fits have consistently suggested disagreement at the \(5\sigma \) level. Beyond the SM, the quark-level transition is described with an effective low-energy Hamiltonian in which the Wilson coefficients are treated as free parameters. We use a suitable set of fourteen measurements (a 14-dimensional observable space) and predictions that depend on two parameters (for the most part) to study the problem with clustering methods. We then select an additional set of six possible future observables and compare their predictions to Belle II projections in order to study their expected impact. This comparison can thus be used to relate the resolving power of future measurements to the neutral B anomalies.

Reference [3] previously introduced a python framework, ClusterKing, that implements clustering parameter points based on the predictions for a single binned observable and we compare our results to that work in Sect. 5. To that end, we consider the decay modes \(B \rightarrow D^{(\star )}\tau \nu \) which currently exhibit the so-called charged B anomalies. We select two kinematic distributions that were also studied in [3] in order to explicitly compare our framework to theirs. Finally, in “Appendices” we provide details relevant to the examples used in the applications.

2 Clustering setup

We are interested in the connection between the parameters of physics models and the resulting predictions for experimental observables. To this effect, each ‘data’ point is defined by fixing the set of model parameters and is represented in both parameter and observable space. Our goal is to look for patterns in observable space which are then investigated in parameter space. We begin by defining our notation.

2.1 Notation

When studying a particle physics model in the context of experimental observations, we typically start from parameter space, for example defined in terms of Wilson coefficients in effective field theory, or via masses and couplings in a supersymmetric model. For a fixed point in parameter space, we can then compute the model predictions, these are the theory predictions for experimental observables, for example branching ratios or angular observables. We can then score how well a parameter point fits the observed values in this observable space via the \(\chi ^2\) function. In the following, we will formalize this notation to efficiently define the input to our clustering approach. Where possible, the notation will follow that introduced in [4].

Observable space: consists of a set of observables \(O_i\) that can be measured experimentally with a precision given by \(\sigma _i^{exp}\), and we denote the experimentally measured value of \(O_i\) by \(E_i\). In many cases, the experimental errors between observables will be correlated, and we describe uncertainties with the variance–covariance matrix \(\Sigma _{ij}^{exp}\).

Model parameter space: the underlying theoretical description of the observables has a set of free parameters. Fixing the values of the parameters (we can denote them as the parameter vector of model k, \(P_k\)) then gives a fully specified model. For each model k, we can calculate the predictions for the considered observables \(O_i\). We denote \(X_{ki}\) the prediction of model k for observable \(O_i\). The theoretical calculation also introduces an uncertainty, \(\sigma _i^{th}\), where for simplicity we neglect its dependence on the model parameters (i.e., we consider that \(\sigma _i^{th}\) is the same for all possible k).Footnote 1 Correlations between theoretical uncertainties of the different observables are again captured by the variance–covariance matrix, \(\Sigma _{ij}^{th}\).

\(\chi ^2\) function: The agreement between model predictions and experimental observations is typically summarized with the \(\chi ^2\) function. When no correlations are present, the \(\chi ^2\) for model k is given by

$$\begin{aligned} \chi ^2_k = \sum _i \frac{\left( E_i - X_{ki}\right) ^2}{\left( \sigma _i^{exp}\right) ^2 + \left( \sigma _i^{th}\right) ^2}, \end{aligned}$$
(1)

and including correlations by

$$\begin{aligned} \chi ^2_k = \sum _{i,j} \left[ E_i - X_{ki}\right] \left( \Sigma ^{exp} + \Sigma ^{th}\right) ^{-1}_{ij} \left[ E_j - X_{kj}\right] . \end{aligned}$$
(2)

Global fits are interested in finding the model k that minimizes the \(\chi ^2\), resulting in the set of parameters that best describes the combination of observables \(O_i\). We will denote this point by BF with associated predictions \(X_{BFi}\).

A simple way of “clustering” the parameter space is to bin the model points according to the value of \(\Delta \chi ^2_k=\chi ^2_k-\chi ^2_{BF}\). This is typically how the results of global fits are reported in parameter space, by drawing contour lines for fixed levels of \(\Delta \chi ^2\) corresponding to confidence intervals.

2.2 Measuring distance between models

Before we can cluster models, we need to define how to measure the similarity (or rather dissimilarity) of two models. This dissimilarity is usually measured as a distance in a coordinate representation of the data points. To characterize points based on their predictions, we work with observable space information, i.e., the \(X_{ki}\).

One possible distance metric could be defined analogous to the \(\chi ^2\) function:

$$\begin{aligned} d_{\chi ^2}\left( X_k, X_l\right) = \sum _{i,j} \left[ X_{ki} - X_{li}\right] \left( \Sigma ^{exp} + \Sigma ^{th}\right) ^{-1}_{ij} \left[ X_{kj} - X_{lj}\right] , \end{aligned}$$
(3)

or its square root, the Mahalanobis distance

$$\begin{aligned} d_M\left( X_k, X_l\right) = \sqrt{d_{\chi ^2}\left( X_k, X_l\right) }. \end{aligned}$$
(4)

This definition is similar, but not identical, to what was used in [3], because in our definition the covariance matrix does not depend on the model parameters, and the distance can thus be computed from a set of coordinates.

However, Eq. 3 is by no means the only distance function that can be used. Here we will be interested mostly in functions that can be implemented from a coordinate representation of the information allowing easy comparison of different types of distance measures. A set of distance functions can be obtained from the general notion of a Minkowski distance based on the p-norm,

$$\begin{aligned} d_p(x, y) = \left( \sum _{i=1}^n |x_i - y_i|^p\right) ^{1/p}. \end{aligned}$$
(5)

where \(p=2\) corresponds to the standard Euclidean norm, \(p=1\) is the Manhattan distance, and \(p\rightarrow \infty \) gives the infinity norm distance (also known as maximum or Chebyshev distance).

These different distance measures emphasize different aspects when assessing similarity, and that is why it is useful to be able to compare them. A larger p generally implies higher sensitivity to outlying variables, with \(p\rightarrow \infty \) representing the most extreme case, when the distance measure reflects only the largest difference across the n-dimensional space. The maximum distance would thus be used to emphasize dominant observables and the Manhattan distance would be preferred if we aim to reduce sensitivity to such points. Alternatively, we may wish to average the effects of all observables by using the Euclidean norm. In high-dimensional spaces, fractional distance metrics may be preferred [5].

To use the different distance measures, we define a coordinate representation that can capture the information used in the definition of \(d_{\chi ^2}(X_k, X_l)\). When neglecting covariance, we define coordinates as

$$\begin{aligned} Y_{ki} = \frac{X_{ki} - R_i}{\sigma _i}, \end{aligned}$$
(6)

with \(R_i\) any fixed reference point (e.g., the origin, \(R_i=0\), the experimental value \(R_i=E_i\) or predictions for special points in parameter space, such as the SM), and \(\sigma _i = \sqrt{(\sigma _i^{exp})^2 + (\sigma _i^{th})^2}\) the combined uncertainty. For this simple scenario, the \(\chi ^2\) distance is reproduced for \(p=2\), i.e., using the Euclidean distance together with the coordinate representation of Eq. 6.

In general, correlation between observables can be large and should not be neglected. Following [4], we can define a coordinate system in observable space that is aware of these correlations. Here we introduce the coordinates

$$\begin{aligned} Y_{ki} = \sum _j \frac{1}{\sqrt{\left( \Sigma ^{-1}\right) _{ii}}} \left( \Sigma ^{-1}\right) _{ij} \left( X_{kj} - R_j\right) , \end{aligned}$$
(7)

to facilitate numerical calculations, and we will refer to Eqs. 6 and 7 as pulls and pulls with correlations, respectively. Alternatively we could define the distance working with the square root of the inverse variance–covariance matrix directly to reproduce the Mahalanobis distance (see discussion in [4]).

In both definitions, we have allowed for a generic reference point \(R_j\) which fixes the origin of our coordinate system. The value of \(R_j\) is irrelevant when computing distances between points and is therefore of no consequence to the clustering result as long as we assume fixed uncertainties (no model dependence of the \(\Sigma ^{-1}_{ij}\)). This is often a good approximation which is used in global fits (see e.g., [6]), but will not always hold, e.g., when the dominant uncertainty is a Poisson error and the shape of the distribution is considerably different between model points as in the problems studied in [3]. In that case, a distance can be directly computed between model predictions without introduction of a coordinate system, and thus without the need for a reference point.

When working with Eq. 7, the reference point should reflect the main point of interest, i.e., the experimental value to characterize predictions based on existing observations or the best fit point when studying a global fit. For the analysis of future observables, we might wish to compare to a special model, such as the SM or one suggested by a different fit.

The described coordinate representation and distance measures work well with the applications considered here. We note however that different (particle physics) problems might require adapting these definitions. One might also explore additional distances that have been introduced in the literature, typically designed to address specific problems and use cases, for example ones available in the philentropy R package [7].

2.3 Hierarchical clustering

The distance function and the coordinates are used to compute the distance or dissimilarity matrix. For K models, that are compared based on their predictions \(X_{ki},~k=1,\ldots K\) with distance metric m, this is the \(K\times K\) matrix \(d_m(k,l),~k,l=1,\ldots K\). With this information, we can now use hierarchical clustering to group the models.

Hierarchical clustering starts by considering all data points as separate clusters and each step of the algorithm combines the two clusters that are most similar to each other. The process continues until all points are combined into a single cluster. In an ideal situation, the solution to the clustering problem consists of finding a number of clusters that ensures all points within a given cluster are similar to each other whereas points belonging to different clusters are not.

Different metrics have been developed to identify the best solution in terms of the number of clusters. In practice (and in the applications studied here), there may not be a clearcut grouping of points. This is especially true for data points that follow a continuous distribution with no discernible gaps between different clusters. However, even in such cases clustering can be useful and distill information as seen in our examples below. Section 3.3.2 describes how we can pick a meaningful number of clusters in our setting.

2.4 Linkage

While the dissimilarity between points is defined via the input to the algorithm, we also need to decide how to characterize dissimilarities between clusters with more than a single data point, the concept of linkage. This choice is crucial to the clustering outcome, which can vary considerably depending on the selection.

We start from p n-dimensional points in the data set. A cluster \(C_j\) defines a set of data points that have been grouped by the algorithm, with \(j = 1,\ldots , \kappa \) and \(\kappa \) is the number of clusters at the current iteration. At the start \(\kappa =p\) and at the final iteration \(\kappa =1\). The preferred value of \(\kappa \) is typically evaluated with a stopping criterion.

We denote d(ab) the distance between two points ab, and \(d_{AB}\) as the distance between the clusters A and B. Some of the commonly used linkage methods are defined as:

  • Single linkage: the distance between clusters is computed as the shortest distance between any two points, \(d_{AB} = min\{d(a,b): a\in A, b\in B\}\)

  • Complete linkage: the distance between clusters is computed as the longest distance between any two points, \(d_{AB} = max\{d(a,b): a\in A, b\in B\}\)

  • Average linkage: in this method \(d_{AB}\) is the average of all distances \(d(a,b), a\in A, b\in B\). We will refer to this definition as “unweighted” average linkage. This is also called unweighted pair group method with arithmetic mean in the literature, and results in a “weighting” by size when combining clusters (a larger group will dominate the distance computation).

    Alternatively there is the “weighted” average linkage (also called the McQuitty or weighted pair group method with arithmetic mean) which directly averages the distances, and thus drops the sensitivity to size when combining clusters. This means when AB have been combined to a cluster C, and we compute the distance of D from C it will be the average distance in the sense that \(d_{DC} = \frac{d_{DA} + d_{DB}}{2}\).

  • Ward linkage: Ward [8] proposed to define clusters by minimizing an objective function, typically taken to be the within cluster dissimilarity. This means at each step we combine the two clusters which result in the smallest increase of this within cluster dissimilarity. In R, two versions are available, ward.D2 linkage squares the dissimilarities before updating, whereas ward linkage does not.

Without a specific objective in mind, Ward linkage is typically preferred, because it leads to balanced clustering results. On the other hand, single or complete linkage are most useful in specific settings, with the first one typically producing large, spread out clustering results, and the latter producing tight clusters that can be close together. In the case of continuously distributed data, single linkage will sequentially merge all points into one large cluster and not produce any new insights, while complete linkage can give interesting results that emphasize large differences between clusters. Within this context, we find that average linkage will often result in similar clustering outcomes as Ward linkage, but with a simpler interpretation of cluster distance.

3 Interpretation of clustering outcome

With our focus on clustering as an exploratory tool, the main outcome of the analysis is the interpretation of the results and the insights this provides into the interplay between the parameter and observable space. Additionally, we select benchmark points that are representative of the clusters and can be used in future analyses.

The interpretation of the clustering outcome is primarily done visually, in both parameter and observable space. This information is linked via a consistent color mapping to show the cluster assignment across views. Note that some visualizations discussed here would also be useful without any clustering information, but the grouping provided by the outcome simplifies comparison across the two spaces.

By using a set of interpretation diagnostics in an interactive interface, we can explore the outcome with different cluster settings in detail. The interactive setting provides important benefits: the results can change drastically depending on the choices made, and we learn about the result by exploring different combinations and by directly comparing different settings.

3.1 Parameter space

Our primary interest is often to obtain information from the observables about the model parameters. Clustering complements the \(\chi ^2\) analysis of a global fit, providing alternative partitioning of the parameter space based on grouping in the observable space. As discussed above, different settings emphasize different aspects of the information.

For comparison to the clustering results, we also present a binning in \(\Delta \chi ^2\) divided into the requested number of clusters. As it is customary to report these intervals in units of Gaussian standard deviations, we first convert \(\Delta \chi ^2_k\) to their corresponding values in \(\sigma \) under the assumption that \(\Delta \chi ^2\) follows a \(\chi ^2\) distribution with a number of degrees of freedom given by the number of free parameters in the model. We then split the full range in \(\sigma \) into \(\kappa \) (the number of clusters) equidistant bins. To simplify the handling of low interest regions corresponding to more than \(5~\sigma \), we include everything above \(5~\sigma \) into the last bin.

Problems with only two model parameters allow us to show the clustering outcome directly in the parameter plane. This can be augmented by highlighting points of special interest, for example the benchmark points selected for each cluster, the overall best fit point or selected models (such as the SM). This graph can then be used to understand how each region in parameter space (or the corresponding benchmark point) maps onto the different patterns of predictions in observable space.

Visualizing parameter space becomes substantially more complicated when the model contains more than two parameters. Existing methods, like profiling or marginalizing, project all data points into a selected 2D plane, summarizing the information across the orthogonal space. To understand how the clusters split up the full parameter space conditional information is more appropriate (i.e., fixing the value of parameters in the orthogonal space). The simple solution adopted here is to display the clustering information in a plane for selected values of parameters not shown. With a regular grid, this can be accomplished by showing a sequence of conditional plots (as was done in [3]). More generally, we can use the new slice tour method [9] to resolve the cluster assignment across higher-dimensional spaces, but this option is beyond the scope of this work.

3.2 Observable space

The clustering is performed in observable space, which thus defines the “data” space. This space is typically high-dimensional, with large numbers of observables usually entering a global fit for example. To understand the outcome, we need approaches to inspect the high-dimensional information patterns. In our applications, we use parallel coordinate plots, tour displays and nonlinear dimension reduction, which is augmented with quantitative information (numerical summaries). Clustering specific visualizations, which were used during the interactive exploration, are described in “Appendix.”

3.2.1 High-dimensional visualization

With any type of high-dimensional visualization, we have to consider the different scales across the coordinate space.Footnote 2 While typically scaling of each variable separately is recommended, this is not the case here. As seen in the description of the coordinate systems, all variables are already scaled based on physical information (the total uncertainties, potentially taking correlations into account). The spread along each variable is thus determined by the variance in predictions for that observable, obtained across the grid, normalized to the total uncertainty. Any remaining differences in scale are thus physically meaningful. They will enter the calculation of dissimilarity and thus should be represented in the plots.

We do however center each variable independently to avoid distraction: a constant offset will not influence the clustering and can be distracting in the visualizations. However, this means that we cannot directly read-off the agreement with the experimentally observed value when using Pull coordinates.

Parallel coordinate plot: a static visualization of the data points in all n coordinates is achieved by mapping each variable to a parallel line. For each data point, we draw a line connecting its values along those parallel coordinates. This display gives a good overview of patterns in the data, for example to detect correlations between the coordinates, or to understand differences between groups. We use the R package GGally [10] to generate this graph, and we map the cluster assignment to color and highlight the cluster benchmark points. This gives a good overview, but is limited to bivariate relations and simple patterns.

Grand tour display: using an animated sequence of two-dimensional projections, we can show the distribution of the data across the full n-dimensional space [11, 12]. The tour shows randomly selected projections that are smoothly interpolated to provide the viewer with a continuous rotation of the high-dimensional distribution [13]. Watching the animation shows the data from all possible viewing angles, and the viewer can extrapolate from the observed low-dimensional shapes and patterns to the high-dimensional distribution. For example, we can use tours to understand grouping, identify multivariate outliers or to understand (nonlinear) correlations between the variables. We have previously used the grand tour to gain insights into the grouping of particle physics observables and to identify outliers in high-dimensional space in [14]. Tour visualizations could also be augmented with information from the dendrogram, as illustrated in [15], but this is beyond the scope of this work.

Dimension reduction: instead of showing the full n-dimensional space, we can map the high-dimensional distribution onto a two-dimensional plane for plotting using nonlinear dimension reduction techniques. These are typically machine learning methods that aim to find the mapping for which the inter-point distance in the low-dimensional space is as similar as possible to the distance in the n-dimensional space. This is often useful for the visualization of clustering, but the results from these methods are not easily interpretable in terms of the original variables, and it can be helpful to compare the results to those in a tour [16]. Here we consider three methods: t-SNE [17], UMAP [18] and local linear embeddings [19].

3.3 Summary information

Ideally clusters should be tight (points within a cluster are very similar to each other) and well separated (the individual clusters are very different from each other). This is often best evaluated visually, as described above. In addition, we also use numeric summaries to characterize the clustering outcome. There are two types of summaries that we consider.

First, we can define statistics that characterize the cluster size and separation. This includes simple measures like the radius and diameter of a cluster, and also comparisons of within to between cluster dissimilarities that are often used to decide the preferred number of clusters. For our applications measures with a physical interpretation are most relevant, measures used in the statistics literature are described in “Appendix.”

A different type of summary is to characterize each cluster by one representative benchmark point. This is the parameter combination that best captures the observable space predictions of all points in the cluster. The benchmark can be used for cluster comparisons, and as a representative in further analyses.

3.3.1 Cluster benchmark points

The benchmark point \(c_j\) for a cluster \(C_j\) is selected as the point which minimizes

$$\begin{aligned} f\left( c, C_j\right) = \sum _{x_i \in C_j} d\left( c, x_i\right) ^2. \end{aligned}$$
(8)

We can then define the radius of the cluster \(C_j\) as

$$\begin{aligned} r_j = \max _{x_i \in C_j} d\left( c_j, x_i\right) . \end{aligned}$$
(9)

We can think of the radius as a measure of accuracy with which the benchmark point represents the full cluster. The interpretation will depend on the coordinate and distance definition. For example, when working with coordinates that mimic a \(\chi ^2\) function (in combination with Euclidean distance), we can relate the cluster radius to a confidence interval with the appropriate number of degrees of freedom.

We may also wish to characterize the similarity of the full cluster, irrespective of the benchmark point. For this case we can define the cluster diameter, which is given as the largest distance between any two points in \(C_j\).

3.3.2 Stopping criterion

In our applications, the coordinates are continuous functions of parameters that can be measured with a certain uncertainty. This suggests a different take on the number of clusters than using the typical cluster statistics described in “Appendix.” Instead, we want to choose clusters such that points that are grouped together are experimentally indistinguishable at some level. This can be quantified with the maximum cluster radius statistic. For example, we may use a stopping criterion based on the radius such that the benchmark point is representative of the cluster up to a selected level in \(\sigma \), or based on the diameter to ensure a selected level of similarity between all points that are grouped together. This can be complemented by also checking the distances between benchmark points: a good solution should have distinguishable benchmark points that are representative of their cluster.

Recall that the overall range in distances between data points (determining the radius) depend on both the anticipated measurement uncertainty and the parameter range that we choose to explore. The latter, as well as the sampling grid spacing, must therefore be chosen judiciously, incorporating any existing information, to obtain a meaningful clustering result.

4 Application: effective Hamiltonian for \(B\rightarrow K^{(\star )}\ell ^+\ell ^-\)

The effective low energy Hamiltonian responsible for the quark-level transition \(b\rightarrow s\ell ^+\ell ^-\) is written in terms of Wilson coefficients and four fermion operators in the general form (see for example [20]):

$$\begin{aligned} {{\mathcal {H}}}_\mathrm{eff} = -\frac{4G_F}{\sqrt{2}}V_{tb}V^\star _{ts}\sum _{i,\ell =\mu ,e}C_{i\ell }(\mu )\mathcal{O}_{i\ell }(\mu ), \end{aligned}$$
(10)

where \(C_{i\ell }\) denote Wilson coefficients, \({{\mathcal {O}}}_{i\ell }\) four-fermion operators and \(\ell \), can be electrons or muons.

Multiple studies over the last decade have suggested that the SM predictions for the Wilson coefficients \(C_{i\ell }\) might not be in agreement with experiment and that the discrepancies can be accommodated by allowing some form of new physics parameterized into these coefficients (see for example [21] and references therein). We consider four operators in this paper (only the two unprimed ones for most of the study):

$$\begin{aligned} {{\mathcal {O}}}_{9\ell }= & {} \frac{e^2}{16\pi ^2}(\bar{s}\gamma _{\mu }P_Lb)(\bar{\ell }\gamma ^\mu \ell ),\quad {{\mathcal {O}}}_{9^\prime \ell } = \frac{e^2}{16\pi ^2}(\bar{s}\gamma _{\mu }P_Rb)(\bar{\ell }\gamma ^\mu \ell ), \nonumber \\ {{\mathcal {O}}}_{10\ell }= & {} \frac{e^2}{16\pi ^2}(\bar{s}\gamma _{\mu }P_Lb)(\bar{\ell }\gamma ^\mu \gamma _5\ell ),\quad {{\mathcal {O}}}_{10^\prime \ell } = \frac{e^2}{16\pi ^2}(\bar{s}\gamma _{\mu }P_Rb)(\bar{\ell }\gamma ^\mu \gamma _5\ell ). \end{aligned}$$
(11)

Within the standard model the values of the corresponding Wilson coefficients are \(C^\text {SM}_{9,10}=4.07,-4.31\) and \(C^\text {SM}_{9^\prime ,10^\prime }=0\) for both muons and electrons. New physics is parametrized in a model independent way as deviations in these coefficients from their SM values, \(C_{i\ell } \equiv C_i^\text {SM}+C^\text {NP}_{i\ell }\) (\(i=9^{(')},10^{(')})\). For simplicity we ignore other possible coefficients, assume that any new physics affects only the muons, and take the parameters to be real ignoring the possibility of CP violation. Notationally, the parameters used in the clustering exercise are the deviations from the SM, the \(C^\text {NP}_{i}\) at a scale \(\mu =4.8\) GeV.

Equation 10 describes a multitude of experimental observables in several decay modes of the B meson in terms of the Wilson coefficients. These coefficients are thus the parameter set of our clustering study. Global fits have been used to extract their preferred values from all existing measurements. For example, the study in [21] considers 175 observables to fit six of the Wilson coefficients. These fits have resulted in values that differ from the SM expectation by around five standard deviations and have therefore attracted much attention in the literature. In this section we study this system as a clustering exercise, and will need to reduce its dimensionality. Recall that within the clustering analysis we are interested in the detailed relations between parameter and observable space. We therefore begin by deliberately reducing the dimensionality of these spaces based on prior knowledge, selecting the parameters and observables of primary interest and later on explore the impact of additional parameters in Sect. 4.5 and the impact of additional observables in Sect. 4.6. To this end we consider only two parameters, namely \(C_9\) and \(C_{10}\) (except for Sect. 4.5 where we add two more) because these have been singled out as the most important ones by the global fit studies. Similarly we will reduce the number of observables to a set of fourteen, listed in Table 1 to work with a two-dimensional parameter space and a fourteen-dimensional observable space.

Twelve of the selected observables relate to aspects of the angular distribution in \({\bar{B}}\rightarrow {\bar{K}}^\star (\rightarrow {\bar{K}} \pi ) \ell ^+\ell ^-\): binned values of \(P_2\) and \(P_5^\prime \) defined as in Eqs. 259, 260 of [22]. The other two will be the ratios \(R_K\) and \(R_{K^\star }\) defined as

$$\begin{aligned} R_{K^{(\star )}}= \frac{BR\left( B\rightarrow K^{(\star )} \mu ^+\mu ^-\right) }{BR\left( B\rightarrow K^{(\star )}e^+e^-\right) }. \end{aligned}$$
(12)

The selection relies on ranking the observables by their importance in constraining the different directions in parameter space as measured by the metrics we introduced in [4].Footnote 3 Of this set:

  • \(P_5^\prime \) has been signaled by multiple studies as a major contributor to the discrepancy between the SM and the global fits, in particular the bins labeled by ID 4, 5 in Table 1. They are mostly important for determining \(C_9\) and their ranking showed little sensitivity to correlation effects.

  • \(P_2\) is singled out by the pull and residual analysis, with the bins with ID 10, 11 in Table 1 being important for determining \(C_9\) (especially when correlations are included).

  • For both \(P_5^\prime \) and \(P_2\) we use all the existing \(q^2\) bins that have been measured, instead of just those that deviate from the SM, to get a more complete picture of the distribution.

  • \(R_K\) (ID 13) and \(R_{K^\star }\) (ID 14) directly test lepton universality and are most important in determining \(C_{10}\). The rankings with or without correlations single out \(R_K\) as being important, whereas \(R_{K^\star }\) ranks high only when correlations are included. The importance of \(R_K\) according to these metrics increased after the 2019 LHCb update [23]. As discussed in [24], the new measurement means that \(R_K\) is now completely dominant for the determination of \(C_{10}\) and that becomes apparent in this study. The same arguments have shown their importance in other studies [25].

This set will permit us to study the tension observed in the global fits between \(R_K\), certain bins of \(P_5^\prime \), and to a certain extent certain bins of \(P_2\).

To generate theoretical predictions for the observables, as well as to obtain the averages of existing experimental measurements, we rely on the package flavio [26]. We first select the region of parameter space to be studied assuming that our interest is to explore the space extending between the SM, \((C_9,C_{10})=(0,0)\), (which is the expected result) and the two dimensional best fit of [27], \((C_9,C_{10})=(-1.08,0.33)\), (which is in reasonable agreement with other global fits).

We sample this parameter space on a \(28\times 12\) grid corresponding to equally spaced (by 0.05) values of \(C_{9}\) between \([-1.2,0.15]\) and \(C_{10}\) between \([-0.1,0.45]\) to produce a set of 336 model points. The coarseness of the grid reflects a compromise between desired resolution and response time in the interactive tool.Footnote 4

For each model point we obtain a prediction, along with a corresponding theoretical uncertainty, \(X_{ki}\pm \sigma _{ki}\) (\(k=1,2,\ldots , 336,~i=1,2,\ldots ,14\)) in the notation of the previous section. We then combine this information with the experimental values \(E_i\pm \sigma _i^{exp}\). To include correlations we use the inverse covariance matrix \(\Sigma ^{-1}\), which is also calculated by flavio and which includes the known experimental correlations as well as the theoretical correlations at the SM point. With this information we obtain coordinates \(Y_{ki}\) as defined in Eq. 6 or Eq. 7 with \(R_i=E_i\).

4.1 Comparison with \(\chi ^2\) and global fit results

For a first overview we start with a comparison of the results of clustering to those from the global fit. For this set of observables one finds a BF point \((C_9,C_{10})=(-0.81,0.12)\) that lies \(\Delta \chi ^2\approx 16.6\), or about \(3.7 \sigma \) from the SM. An evaluation of the \(\chi ^2\) function on the grid finds the minimum at the point \((C_9,C_{10})=(-0.8,0.1)\) and this point, along with the SM, \((C_9,C_{10})=(0,0)\) are highlighted in the figures below as an asterisk and an open circle respectively.Footnote 5

The \(\chi ^2\) function can be thought of as a one-dimensional response to the input in observable space that can be used to cluster the points and can be mapped onto parameter space. The maximum \(\Delta \chi ^2\) for points on our grid is 18.66 which, for two degrees of freedom, corresponds to \(3.92\sigma \). We use this to partition the space into \(\kappa \) clusters by placing boundaries at \(3.92\sigma /\kappa \) intervals.

We begin with a comparison between the \(\chi ^2\) based classification and one possible outcome of hierarchical clustering in Figs. 1 and 2. For this first comparison we choose as coordinates the pulls from experiment with correlations included, Eq. 7 with \(R_i=E_i\), and Ward.D2 linkage with Euclidean distance for \(\kappa =5\). The regions classified by \(\chi ^2\) and shown on the left panel, are then lying at up to \(0.78\sigma ,~1.57\sigma ,~2.35\sigma ,~3.13\sigma ,\) and \(3.92\sigma \) from the BF. Recall that here we are interested in equidistant bins in \(\sigma \), where the number of bins is chosen to match the number of clusters investigated in the following. Therefore, the resulting ellipses do not match the classically used \(\sigma \) intervals.

The number of clusters used, \(\kappa =5\), gives a value of adjusted Rand index (ARI) near .3 when the clustering is compared to the \(\chi ^2\)-based classification. The largest ARI for these clustering parameters, about 0.5, occurs for \(\kappa =2\) (i.e. the clustering result is most similar to the binning in \(\chi ^2\) with the minimum number of clusters).

Fig. 1
figure 1

The results of classifying the model points into five regions with boundaries at \(0.78\sigma ,~1.57\sigma ,~2.35\sigma ,~3.13\sigma ,\) and \(3.92\sigma \) from the BF

The clustering result shown in Fig. 2 (left) already illustrates that this procedure contains complementary information to the \(\chi ^2\) classification. Although the clusters roughly have a similar shape, the \(\chi ^2\) considers both sides of the BF to be equivalent. This is why the ARI index is maximized for only two clusters in this case. The parallel coordinate display, shown in the right panel of Fig. 2, can be used to understand the difference between the two classifications. The light green and purple clusters both have large overlap with the \(0.78\sigma \) region in Fig. 1, but the values of \(R_K\) and \(R_{K^\star }\) (coordinates \(O_{13}\) and \(O_{14}\) on the plot) clearly distinguish between them. Similarly the pink and brown clusters of the right panel share partial overlaps with the \(1.57\sigma \) and \(2.35\sigma \) regions, but the parallel coordinate plot shows which observables separate them. Comparisons such as this one provide insight into distinguishing parameter points with identical \(\chi ^2\).

Note here that the clustering was a necessary step to enable us to make the connection between the parameter space (as shown on the left in Fig. 2) and the observable space represented in the right plot of Fig. 2: we use it to partition the parameter space into regions that map in a meaningful way onto the observable space, and connect the information using color in the two displays.

Fig. 2
figure 2

Clustering with Ward.D2 linkage (which minimizes the variance within clusters) and Euclidean distance on pulls from experiment (left panel), and the corresponding parallel coordinates for all 14 observables (right panel) with color code matching. The darker line for each color in the parallel coordinate plot marks the cluster benchmark (also indicated on the left, with an open diamond symbol)

Interpretation of the different scales appearing in the parallel coordinate display can also be of interest. Recall that here we are comparing centered coordinate values as defined in Eq. 7, and the y-axis of Fig. 2 (right) thus indicates the mean value (zero after centering) and moving one unit away from the mean value. Since these units include information about the uncertainties as well as their correlations, the variance along each direction can be interpreted directly as a measure of how relevant this observable might be in the clustering: a larger spread means considerable variation in predictions when taking into account all sources of uncertainty, and thus these observables can resolve different regions in parameter space. On the other hand, little variation indicates that the current uncertainties make this observable less useful in distinguishing the considered parameter points. This can be augmented with information from the tour display, in particular to better understand correlations between predicted values.

Clustering can also shed light into internal tensions in a global fit. The parallel coordinate plot without centering shown in Fig. 3 serves to make this point. In this figure, information about the central value of the experimental result for each observable is retained as the origin of the vertical axis, and is shown as a black horizontal line (all pulls vanish when evaluated at the reference point). For example the light green cluster is closer to experiment than the purple one for many observables but not for \(O_{1,2,13}\). In particular the bins of \(P_5^\prime \) that contribute most to the discrepancy between SM and BF (\(O_{4,5}\)) pull in a different direction than \(R_K\) (\(O_{13}\)).

Fig. 3
figure 3

As Fig. 2 but without centering. The black horizontal line marks the position of the reference point, in this case the central value of the averaged experimental results

For each cluster, it is possible to find a representative defined by its centroid (see Eq. 8), which can then be used as a benchmark for comparative studies. In Fig. 1 the positions of these benchmarks in parameter space are marked by diamonds, and they are highlighted in the parallel coordinate plot. The parameter values for each of these benchmarks can also be read off from the “benchmarks” tab on the app, which is most useful in problems with more than two parameters.

4.2 Clustering choices

In arriving at our first clustering result, we had to make a number of choices: the coordinate representation and distance metric, the linkage method and finally the number of clusters to be shown. These choices could have a large impact on the resulting clustering, and they should be informed by the underlying problem and question. In practice we suggest to look at the distribution in observable space and to try out different settings to complement existing knowledge to make the most of the result. Here we illustrate this by discussing the number of clusters while keeping the other settings fixed.

We first use the tour display to study the distribution of points in the 14D observable space revealing that the points are distributed continuously on a 2D surface, see animation here. This can be understood as a consequence of our setup: our model points sit on a uniform grid in parameter space and all the observables are smooth continuous functions of the parameters. This means that we cannot expect clearly separated clusters and most of the cluster statistics typically used to determine a preferred number of clusters will not be effective here. However, since our coordinates are normalized to the total uncertainty, the cluster size has a physical interpretation: we can select the number of clusters that results in grouping points that are considered indistinguishable at a selected level of confidence.

Partitioning this data set necessarily involves some arbitrariness, introducing splits along the 2D surface. To illustrate how this is not necessarily a shortcoming we now consider using the average linkage method on the pull coordinates with covariance and Euclidean distance. In Fig. 4 we show a sequence of three, five and seven clusters. This sequence shows the first two partitions occurring roughly along the lines \(C_{10}-C_9\approx 1~(0.6)\), corresponding to fixed values of \(R_{K^{(\star )}}\sim 0.76~(0.86)\) respectively.Footnote 6 The variation in the predictions for \(R_K\) is shown in Fig. 5 (right), and confirms the conclusion that this split happens along the direction of constant \(R_K\) (\(O_{13}\)).

In the next step two splits occur. First, the dark green group is partitioned in two (red and blue cluster in the second row) at some fixed value of \(R_K\) as can be confirmed in the right panel, which shows the two new groups separated mostly in \(O_{13}\). Second, the purple group splits in two (now bright orange and purple) roughly at \(C_{10}\sim 0.07\). The parallel coordinate plot shows this being due mostly to certain bins of the \(P_2\) observable (\(O_{11,12}\)). Note here that while the highlighted benchmark points of the two groups do differ in \(O_{13}\), the two groups overlap in this observable.

Finally in the last step, the bright orange cluster splits into what is now shown in brown and yellow near \(C_9\sim -.95\) and the bright green cluster splits into what are now the light purple and pink clusters near \(C_{10}\sim 0.07\). These splits are more subtle and due to the combination of multiple observables.

Fig. 4
figure 4

The results of three, five and seven clusters using Euclidean distance on the pulls with covariance and average linkage in parameter space (left panel) with the corresponding parallel coordinate plots (right panel)

Fig. 5
figure 5

Variation of observables 12 (left, \(P_2\) at high \(q^2\)) and 13 (right, \(R_K\)) with the two model parameters. Note that the color scale shows the variation in terms of the centered pulls calculated with correlations included (Eq. 7), thus zero indicates the mean value, and any variation away from the mean is measured in those units

The optimal number of clusters in this case can be decided by considering their size and separation. To this end we show in the top panel of Fig. 6 the maximum cluster radius (i.e. the largest distance any point has from its representative benchmark) and the minimum benchmark distance (the smallest separation between two benchmarks) as functions of the number of clusters. These two metrics capture the similarity of points within a given cluster and the dissimilarity between different clusters respectively. Following our discussion in Sect. 3.3.1 we interpret the cluster radius as a confidence level contour. The square of the Euclidean distance on the pullsFootnote 7 is a good approximation to \(\Delta \chi ^2\), implying for this example that a cluster radius \(r_j\lesssim 1.5\) corresponds to a cluster \(C_j\) of points that are indistinguishable at the \(1\sigma \) level from their benchmark point \(c_j\)Footnote 8. The left panel of Fig. 6 shows that with 5 clusters all radii are below this value. The smallest number of clusters satisfying this condition corresponds to the desired partitioning: the simplest solution in which all the points within a cluster lie at most within \(1\sigma \) of the benchmark. Similarly, the minimum benchmark distance statistic, shown in the right panel of Fig. 6, reveals that for 5 clusters any two benchmarks are separated by at least \(d(c_i,c_j)\gtrsim 1.5\) so they are distinguishable at the \(1\sigma \) level. These two statistics combined, indicate a preferred solution of five clusters for this problem. Alternative criteria to determine the number of clusters can also be constructed; for example using the maximum cluster diameter which gives the maximum separation between any two points within a given cluster. For these five clusters, this value is about 3, which for two degrees of freedom corresponds to about \(2.5\sigma \).

A somewhat different interpretation is possible using maximum distance. In this case choosing a maximum cluster radius of one indicates that none of the predictions differ from the benchmark point by more than one unit of combined theoretical and experimental uncertainty. In the bottom panel of Fig. 6 we show the corresponding metrics for maximum distance with complete linkage on the pulls. The maximum radius statistic suggests four to six clusters whereas the minimum benchmark distance suggests five clusters. The five clusters for this choice of distance and linkage differ from the center panel of Fig. 4 mostly in that all boundaries now follow the correlation implied by \(R_K\) and \(R_{K^\star }\).

Fig. 6
figure 6

Maximum cluster radius (left) and minimum benchmark distance (right) when using pull coordinates with correlations and Euclidean distance with average linkage (top); maximum distance with complete linkage (bottom)

4.3 Correlation, collective effects and dominant observables

Next we want to explore how other choices affect the results, and how we can connect this to physics insights. For simplicity here we restrict our study to results with three clusters, pull coordinates and average linkage. We vary the distance metric used and whether correlations are included in the coordinate definition.

First, we repeat the clustering shown in the upper panel of Fig. 4, but computing the coordinates without correlations. This is shown in Fig. 7 (left) and we see that the partitioning now occurs exclusively along \(C_9\) (while there is of course still some dependence on \(C_{10}\) is is much smaller than that on \(C_9\), and thus is not visible at the selected grid coarseness and number of clusters, see Fig. 29 and the associated discussion in “Appendix”). Indeed, the effect of correlations is known to be important for the global fit of \(b\rightarrow s\ell ^+\ell ^-\) observables. The parallel coordinate plot shown in Fig. 7 (right) shows that these clusters are neatly separated along the first twelve observables but mix for \(R_K\) and \(R_{K^\star }\) (\(O_{13,14}\)). Thus, the angular observables have gained importance when correlations are neglected (the reduced importance of these observables in the global fit when accounting for correlated errors was previously noted with the metrics developed in [4]).

Fig. 7
figure 7

The results of three clusters using Euclidean distance on the pulls without covariance and average linkage in parameter space (left panel) with the corresponding parallel coordinate plots (right panel)

These results clearly show that the angular observables suggest a different partitioning of the parameter space compared to \(R_K\) and \(R_{K^\star }\) (\(O_{13,14}\)), and the results depend critically on having one of them dominate. These insights could also be obtained by experimenting with different linkage methods, or by dropping the dominant observables \(R_K\) and \(R_{K^\star }\) from the input.

For example, we can explore the partitioning of the parameter space when including the correlations in the coordinate definition, but using Manhattan distance, or when dropping correlations but using Maximum distance. These results are shown in Fig. 8 (left and middle). We find that Manhattan distance, which de-emphasizes dominant observables shows results similar to those in Fig. 7 (i.e. when the importance of angular observables is exaggerated by neglecting correlations). Similarly, when placing the emphasis on dominant observables by using Maximum distance, we find that the patterns match those of Fig. 4 even when correlations are not included.

Finally, we can compare these results with the clustering obtained after explicitly removing \(R_K\) and \(R_{K^\star }\) from the dataset. This is shown in Fig. 8 (right) and produces a different pattern.Footnote 9 Because we have dropped the two observables that previously dominated the result and induced a positive correlation between the two parameters, we now observe the negative correlation induced by some of the angular observables instead. As an example we show the variation in predictions for observable 12 (highest \(q^2\) bin of \(P_2\)) in Fig. 5 (left).

Fig. 8
figure 8

The results of three clusters using average linkage and pull coordinates with different distance metrics: Manhattan distance with covariance (left panel), Maximum distance without covariance (middle panel) and Euclidean distance with correlations, but dropping \(R_K\) and \(R_{K^\star }\) (right panel). For the associated parallel coordinate plots see Fig. 28 in “Appendix”

While these comparisons of course do not show exact matches, the overall patterns observed give useful hints that can aid the physics interpretation of the importance of different observables, and the types of patterns they induce in the clusters in parameter space.

4.4 Beyond fixed errors

One of the assumptions we have made up to now is that the theoretical uncertainties for all observables are independent of the model parameters. This is clearly an approximation, but one that is often used in global fits to simplify the optimization. For the observables considered here this assumption was assessed to be valid in [27] by direct comparison of the covariance matrix, but the detailed impact on the results was not studied.Footnote 10

Here we study how the computation of theory uncertainties and correlations affect the clustering outcome. The first thing to notice is that when we use model independent errors in our coordinate definitions (Eqs. 6 and 7 ), the distance between two models does not depend on the reference point \(R_i\). Instead, when the errors are model dependent the distance calculation does depend on the reference point chosen. Here we will work with pull coordinates with respect to the experimentally measured values.

Uncertainties are evaluated via sampling of the nuisance parameters and this introduces a further statistical error that differs between parameter points. Finally, because our implementation for the computation of coordinates on the fly assumes fixed input for the covariance matrix, we have to compute the coordinate representation outside the app and load the resulting values as user defined coordinates. This section thus serves as an example of how to use the interface when the desired coordinate definition is not covered by those introduced in Sect. 2.2.

To illustrate the results we reproduce the simple clustering used before: three and five clusters using average linkage and Euclidean distance, but now computing the coordinates using the model specific covariance matrix for each parameter point. The resulting clustering in \(C_9\) vs \(C_{10}\) is shown in Fig. 9 (left and middle), together with the corresponding parallel coordinate plot for 3 clusters (right).Footnote 11

Fig. 9
figure 9

The results of three (left) and five (centre) clusters using Euclidean distance on the pulls with correlations and average linkage with theoretical errors evaluated at each model point shown in parameter space and the corresponding parallel coordinate plot for three clusters (right panel)

At small absolute values of the Wilson coefficients (i.e. near the SM point) the result looks similar to what was found with fixed errors: it indicates a correlation of \(C_9\) and \(C_{10}\) that results in approximately fixed values of \(R_K\). This is expected since the fixed errors had been evaluated at the SM point. However, we note that the boundaries change slightly as we move away from the SM, most notably the partitioning between purple and brown clusters which shifts toward larger (negative) values of \(C_9\). As the number of clusters increases, the differences with Fig. 4 become more evident. This is, of course, expected as finer details become more important and is illustrated with the five cluster partition in Fig. 9.

4.5 More than two parameters

In general there will be more than two parameters affecting the predictions. To generalize these tools to incorporate that possibility we need to visualize more than two dimensions in both parameter and observable space. One possibility to study the resulting clusters would use two separate tour displays with the capability to slice [9, 28]. This is beyond the scope of the present work but will be considered in a future publication.

However, we can use the present tool to visualize clusters with two dimensional slices of parameter space provided the parameter scan is on a regular grid, because this allows us to select points in a slice using an exact condition on the additional parameters. To see how this works we continue with our example of neutral B-anomalies, where the global fit studies suggest that two additional parameters, \(C_{9^\prime }\) and \(C_{10^\prime }\), may also play a non-negligible role.

We reproduce the settings from Sect. 4.2, i.e. we use average linkage with Euclidean distance on the pulls and split the data into three clusters.

In Fig. 10 we illustrate the clustering in the \(C_{9}-C_{10}\) plane with two slices with \(C_{10}^\prime =0\) and \(C_9^\prime = -0.1\) (left panel) and then \(C_9^\prime = 0.5\) (center-left panel). The by now familiar correlation along fixed values of \(R_K\) appears, but this time the size of the clusters changes as we vary \(C_9^\prime \) revealing the influence of the additional parameters. Next we show the \(C_{9}-C_{9}^\prime \) plane using a slice with \(C_{10} = 0.2\) and \(C_{10}^\prime =0\) (center-right panel). The correlation seen in this slice can be traced primarily to \(R_K\) as well, as can be seen in figure showing the variation in predictions for \(O_{13}\) across the \(C_{9}-C_{9}^\prime \) plane (right panel).

Fig. 10
figure 10

Three cluster separation of four parameter models with average linkage and Euclidean distance of pulls with correlations. Slices on \(C_{9}-C_{10}\) plane at \(C_9^\prime = -0.1\) and \(C_{10}^\prime =0\) (left-most) and \(C_9^\prime = 0.5\) and \(C_{10}^\prime =0\) (second left). Slice on the \(C_{9}-C_{9}^\prime \) plane at \(C_{10} = 0.2\) and \(C_{10}^\prime =0\) (second right). Variation of predictions for \(O_{13}\) across the \(C_{9}-C_{9}^\prime \) plane (right)

It is suggestive to compare the clustering in observable space for the cases where the models depend on two and four parameters. With two parameters the predictions sit on a 2D manifold in 14 dimensions, whereas with four parameters they span a 4D manifold. This is easy to understand as fourteen functions of n parameters correspond to a parametric representation of an nD manifold in 14 dimensions. Interestingly, noise from the marginalization of nuisance parameters shows up in these structures as “thickness” of the expected dimensionality.Footnote 12 In Fig. 11 we illustrate this with two still plots from tours of five clusters in observable space for two and four parameters containing links to animated gifs. The plot in the right panel shows a 2D display of the four parameter data in observable space. Since our example does not contain clearly separated clusters, the points also appear continuously spread out, with some artificial structure introduced from the non-linear mapping which is not reflecting the clustering outcome. For example, clusters are spread across small gaps that appear in the display. In this case the information provided by the tour is more useful.

Fig. 11
figure 11

Still image from the tour showing five clusters in 14D observable space for models with two parameters (left, see animation here) and four parameters (middle, see animation here). For comparison we also show the 2D view obtained with t-SNE which is not resolving the clusters for this example

In this setting the \(\chi ^2\) function varies very slowly along certain directions in the \(C_{9^\prime }=C_{10^\prime }\) plane and this affects its minimization. The clustering is not affected by this, but can result in hyper-cylindrical clusters. In Fig. 12 we show four slices in the \(C_9-C_{10}\) plane corresponding to \(C_{10^\prime }=-0.2\) and \(C_{9^\prime }=-0.1,~0.05,~0.2,~0.35\) respectively. The top row illustrates how the \(\chi ^2\) function varies slowly along this direction while the bottom row illustrates a similar cluster behavior. But this particular functional dependence does not affect the clustering algorithm, which is grouping together points with similar predictions. The clustering outcome can however illustrate the limited sensitivity along certain directions.

Fig. 12
figure 12

Partitioning into four clusters of data with four Wilson coefficients: above confidence level contours; below: clusters with average linkage and Euclidean distance

Fig. 13
figure 13

The results of three clusters using Euclidean distance on the pulls with covariance and average linkage in parameter space (left panel) with the corresponding parallel coordinate plots (right panel) for all 89 observables

4.6 Additional observables

Up to now we have only considered a small subset of the 175 observables included in the global fit of [6]. The selection was guided by the fit results and its detailed analysis in [4]. We now explore how additional observables could provide complementary information if included in the cluster analysis. To this end we consider a much larger subset that is easily accessible via an implementation in flavio, 89 observables, and we perform a preliminary analysis selecting observables for further study. From the observables entering the fit of [6] we have included in our set of 89 the LHCb measurements except some of the observables describing the angular distribution of \(B_s\rightarrow \Phi \mu \mu \) which are not yet defined in flavio. None of the excluded observables were found to have a major impact in our study of [4].

In Fig. 13 we show the results of three clusters with the same settings as in Fig. 4 allowing for direct comparison: average linkage with Euclidean distance on the pulls including all known correlations. The left panel displays the partition of parameter space and the right panel the associated parallel coordinates including all 89 observables with the original 14 of Table 1 labeled \([1-14]\). The cumulative effect changes the partitioning seen in Fig. 4. The parallel coordinate plot shows that although \(R_K\) is still dominant, there are a few other important observables which we have placed near the end of the list. The largest effect would come from number 86, \(Br(B_s\rightarrow \mu ^+\mu ^-)\), which is also known to play an important role in the global fit. For example, if we keep only the first 85 observables, the partition of parameter space shown in the right panel of Fig. 14 is very similar to that in Fig. 4 with only our original subset of 14 observables validating our choice. On the other hand, the large set of observables provides increased resolution in the partitioning of parameter space. With the same criteria used so far, the resolution increases from four clusters with the dominant 14 observables to nine clusters with all 89. For completeness we show this clustering in the parameter space in Fig. 30 in “Appendix.”

Fig. 14
figure 14

Variation of the prediction for the two selected observables \(Br(B_s\rightarrow \mu ^+\mu ^-)\) (left panel) and \(P_4^\prime (B_0\rightarrow K^*\mu ^+\mu ^-)[0.1-0.98]\) (middle panel). The variation is shown in terms of the pulls calculated without correlation effects. The right panel shows the partition of parameter space when only the first 85 observables are kept

In Fig. 15 we illustrate the effect of adding \(Br(B_s\rightarrow \mu ^+\mu ^-)\) (\(O_{15}\) in this plot) to the set of observables in Table 1. The left panel shows the partitioning of parameter space into three clusters using Euclidean distance on the pulls with covariance and average linkage and the right panel shows the corresponding parallel coordinates. The latter indicates that the importance of \(Br(B_s\rightarrow \mu ^+\mu ^-)\) is almost as large as that of \(R_K\) and that it produces a very different clustering. The left panel of Fig. 14 shows the variation of \(Br(B_s\rightarrow \mu ^+\mu ^-)\) across the parameter space indicating that it mostly partitions the space along \(C_{10}\). It has been pointed out before that there are different treatments for the combination of measurements for this observable in the literature and that they lead to very different conclusions. With flavio we are using an average of experimental results \(Br(B_s\rightarrow \mu ^+\mu ^-) = (2.81^{+ 0.24}_{-0.22})\times 10^{-9}\) whereas [29], for example, finds \(Br(B_s\rightarrow \mu ^+\mu ^-) = (2.94 \pm 0.43)\times 10^{-9}\). The difference in these errors leads to different conclusions at present as can be seen by comparing the top and bottom row of Fig. 15. Note here that the apparent difference in crossing between predictions for the last two observables is a result of the different scales, but the relation between the two is fixed by the theoretical expression. When looking at the parallel coordinate plot with rescaled axes (rescaling each coordinate to have variance of one before plotting, this version always presented in the app alongside the unscaled plot), we can see that the behavior is the same in both cases.

Fig. 15
figure 15

The results of three clusters using Euclidean distance on the pulls with covariance and average linkage in parameter space (left panel) with the corresponding parallel coordinate plots (right panel) for the 14 observables in Table 1 plus \(Br(B_s\rightarrow \mu ^+\mu ^-)\). The top panel corresponds to the combination of measurements from flavio and the bottom panel to the combination from [29]

In Fig. 16 we illustrate the effect of adding \(P_4^\prime (B_0\rightarrow K^*\mu ^+\mu ^-)[0.1-0.98]\) (\(O_{15}\) in this plot) to the set of observables in Table 1. With the current LHCb measurement, \(P_4^\prime (B_0\rightarrow K^*\mu ^+\mu ^-)[0.1-0.98] = 0.135\pm 0.118\), [30], the error in this observable is too large to have a significant impact on the fit. However, the central panel of Fig. 14, showing the variation of this observable across the parameter space, indicates a very interesting pattern with a parameter correlation orthogonal to that seen in \(R_K\). This indicates that if the error can be reduced by a factor of four by future measurements, this observable can provide very useful information as we illustrate under this assumption of a reduced error in the left panel (partition of parameter space) and right panel (parallel coordinates) of Fig. 16, for three clusters using Euclidean distance on the pulls with covariance and average linkage.

Fig. 16
figure 16

The results of three clusters using Euclidean distance on the pulls with covariance and average linkage in parameter space (left panel) with the corresponding parallel coordinate plots (right panel) for the 14 observables in Table 1 plus a projection for a possible future measurement of \(P_4^\prime (B_0\rightarrow K^*\mu ^+\mu ^-)[0.1-0.98]\) as described in the text

4.7 Assessing the impact of future measurements

Another application of clustering is to assess the impact of future measurements, both on clustering problems and on global fits. In this case clustering can also provide information to plan or prioritize future measurements. Different models can be compared before measurements are made to understand how different predictions map onto different regions of parameter space. This can be used, for example, to select benchmark points for detailed study.

When measurements do not exist, additional choices must be made when selecting coordinates. The reference point \(R_i\) should be set based on a preferred model, for example the SM or a best fit point obtained in a global fit to existing measurements.Footnote 13 The uncertainties used to normalize the coordinates can be estimated by combining theoretical uncertainties with anticipated statistical uncertainties in future measurements. Moreover, since the latter correspond to an expected luminosity, the clustering outcome can show how the partitioning of parameter space varies with luminosity.

We now continue the study of processes which can be described with Eq. 10 by considering modes that have been proposed in the literature but not yet measured. Many examples of this kind are listed in [6]; and the Belle collaboration has already reported some preliminary results for a few of them [31].

We use here lepton flavor violating observables that directly compare angular distributions in the decay \(B\rightarrow K^\star \mu ^+\mu ^-\) to the corresponding ones in the decay \(B\rightarrow K^\star e^+e^-\),

$$\begin{aligned} Q_2\equiv P_2^\mu -P_2^e,&Q_5\equiv P_5^{\prime \mu }-P_5^{\prime e}. \end{aligned}$$
(13)

These observables directly test lepton universality like \(R_{K^{(*)}}\) do, providing a direct test for physics beyond the SM. In addition, they directly compare the \(P_{2}\) and \(P_5^{\prime }\) observables that have shown deviations from the SM for final state muons and electrons. We use the set of six observables listed in Table 3, which satisfy the following criteria:

  • There exist projections for the sensitivity that can be achieved by Belle II [22].

  • They were singled out as important by the pull and residual analysis of [4]. In addition, it has been suggested that \(Q_5\) can play an important role in separating different NP scenarios [32].

The theoretical predictions for these observables, along with their corresponding uncertainty, is computed for the same model points of the previous section with the aid of flavio, and we limit ourselves to the two parameter case (\(C_{9}\), \(C_{10}\)) and with all theoretical uncertainties evaluated at the SM point.

We begin by clustering the pulls with the SM as a reference point, an estimated experimental covariance matrix for 50 ab\(^{-1}\) from Eq. 15, and using average linkage with Euclidean distance.Footnote 14 For these choices, we show in Fig. 17 the maximum cluster radius and minimum benchmark separation as a function of the number of clusters. The same arguments of Sect. 4.2 tell us that the optimal number of clusters for this observable set is four.

Fig. 17
figure 17

Maximum cluster radius (left) and minimum benchmark separation (right) for average linkage and Euclidean distance on pulls with correlations for 50 ab\(^{-1}\) as a function of number of clusters

The corresponding clusters in parameter space are shown in Fig. 18 along with the corresponding parallel coordinate plot. The position of the cluster benchmarks (indicated by the diamonds) and the shape of the cluster boundaries already suggest that these observables will be mostly sensitive to \(C_9\). The parallel coordinate plot shows that all the observables considered in Table 3, with the possible exception of the lowest \(q^2\) bin for \(Q_2\) (\(O_1\)), provide a clean separation of the clusters. The variations in predictions across parameter space indicate that \(O_1\) (also shown in Fig. 18) has the best chance of providing resolving power for \(C_{10}\), but to increase its importance, the relative precision of this measurement with respect to the others must increase. \(O_6\) also shows a slight correlation between \(C_9\) and \(C_{10}\) but has a different pattern than \(O_1\).

Apart from comparing clusters, we can also use the parallel coordinate plot to understand correlations between the observables more generally. In this example we see clear positive correlation between most of the observables, and negative correlation between observables 3 and 4.

Fig. 18
figure 18

Four clusters using average linkage and Euclidean distance on pulls from the SM with estimated experimental covariance matrix as in Eq. 15 for the observables in Table 3 (left) with corresponding parallel coordinates (center). Variation in predictions across the parameter space for \(O_1\) (right)

Clustering with complete linkage and maximum distance yields similar results, reinforcing the conclusion that there are no standout coordinates among this set. Variations are also minor if the correlations are not included, suggesting that they are not important at the level implied by our estimate in Eq. 15.

The projected errors for a luminosity of 5 ab\(^{-1}\) are larger by about a factor of 3. Consequently we expect the resolution to be about three times worse, indicating that the same region of parameter space shown in Fig. 18 cannot be resolved with this data.

4.8 Future measurements in the context of an existing set

Future measurements of new observables will not be considered in isolation, instead they will be added to the existing set. We now illustrate the impact of measuring the set of observables in Table 3 on the clustering problem of Sect. 4. Of course, this is only a partial study, as a complete analysis for physics projections would also require us to estimate future improvements in the measurements of the original set of 14 observables.

In light of our previous discussion we expect several things to happen when the two sets are combined: the resolution measured by the stopping criterion should increase; the original clusters of Sect. 4 will change to reflect the dominant patterns in the complete set of observables. These changes may be significant if the new observables provide the dominant constraints. It is instructive to compare the combination using two values for the Belle II luminosity, 5 ab\(^{-1}\) and 50 ab\(^{-1}\).

In Fig. 19 we show the maximum cluster radius for average linkage and Euclidean distance on pulls with correlations as a function of the number of clusters. The covariance matrix is a diagonal combination of the two matrices previously used (the \(14\times 14\) matrix of Sect. 4 and the \(6\times 6\) matrix of Sect. 4.7), it ignores correlations between the original and new sets of observables. Requiring the maximum radius to be below \(\sim 1.5\) (since we still have only two parameters), the suggested number of clusters grows from five in Sect. 4 to six for the 50 ab\(^{-1}\) projections. The minimum benchmark separation in this case also increases to about 1.7.

Fig. 19
figure 19

Maximum cluster radius for average linkage and Euclidean distance on pulls with correlations as a function of number of clusters for the combined observables of Tables 1 and 3 . The left (right) panel shows the results for 5 ab\(^{-1}\) (50 ab\(^{-1}\)) for the future measurements

The parallel coordinates reveal several of the new observables may reach similar importance to \(R_K\) at 50 ab\(^{-1}\) but not at 5 ab\(^{-1}\).Footnote 15 Their combined effect is likely to dominate the clustering in this situation at the higher luminosity (but recall that this plot does not include future projections for the first fourteen coordinates which include \(R_K\)). In Fig. 20 we illustrate this using only three clusters for simplicity.

Fig. 20
figure 20

Parallel coordinates for three clusters. The top (bottom) panel shows the results with 5 ab\(^{-1}\) (50 ab\(^{-1}\)) for the future measurements

The effect of the new observables at the two values of luminosity is shown in Fig. 21. The left panel shows five clusters assuming 5 ab\(^{-1}\) for Belle II. The results look similar to Fig. 4, particularly with respect to the shape of the boundaries, indicating that \(R_K\) is still dominant, as can also be seen in the top panel of Fig. 20. In contrast, with a projected luminosity of 50 ab\(^{-1}\) for Belle II, \(R_K\) is no longer dominant and we illustrate the partition into five (center) and six (right) clusters in Fig. 21. The best fit marker in that figure assumes that the future measurements will fall on the current best fit.

Fig. 21
figure 21

The results of five and seven clusters using Euclidean distance on the pulls with covariance and average linkage in parameter space with the combined observables of Tables 1 and 3 . The left (center) panel shows 5 clusters assuming 5 ab\(^{-1}\) (50 ab\(^{-1}\)) for the future observables, and the right panel six clusters assuming 50 ab\(^{-1}\)

4.9 Summary of results

We have used the new interactive tool pandemonium to cluster the predictions for fourteen observables obtained from models with two (and four) parameters. Salient points of this study include:

  • Clustering the observables partitions the parameter space into distinct regions with a mapping between each region and its corresponding observable predictions. Pandemonium offers a range of tools to study that mapping interactively.

  • The discriminating power of the clustering outcome is determined by the number of clusters. We propose that this number is given by the minimum number of clusters for which all points within a cluster \(C_j\) lie at most within \(1\sigma \) of its corresponding benchmark \(c_j\) while benchmarks are separated by at least \(1\sigma \) . Clustering the pulls with Euclidean distance and average linkage for this problem, partitions the parameter region \((-1.2\le C_9\le 0.14) \wedge (-0.1 \le C_{10}\le 0.45)\) into five clusters. We have shown that this can be increased to six clusters including six new observables with the expected sensitivity at Belle II with 50 ab\(^{-1}\).

    The tool allows the user to choose a different stopping criterion. We have proposed an alternative that uses maximum distance with complete linkage which also results in five clusters for this example.

  • Observables that dominate the clustering are those that have the largest variation in prediction relative to their uncertainty across the parameter range. In this set, \(R_K\) (ID 13) (and to a lesser extent \(R_{K^\star }\), ID 14) plays an outsized role in partitioning parameter space. The parallel coordinates (including correlations) show that this observable has the largest variance in pulls, i.e. the range in predictions is large compared to the total uncertainty, and this is what makes it a “dominant” observable. The dominance of \(R_K\) is evident as it determines the shape of the boundaries between main clusters.

  • We have illustrated several ways to showcase the effect of a dominant observable, such as combining complete linkage with maximum distance. Alternatively, it is also possible to suppress dominant observables, for example using the Manhattan distance. These methods provide insights into the effect of removing a dominant observable and in this way study internal tensions in a global fit. For example, the global fits show that removing \(R_K\) significantly shifts the BF value of \(C_{10}\) upwards. In the clustering, this shift is accompanied by a change in the shape of the inter cluster boundaries indicating a very different correlation between \(C_9\) and \(C_{10}\) with and without \(R_K\).

  • Finer partitioning occurs along an approximately constant value of \(C_{10}\) (going from 4 to 5 clusters) or \(C_9\) (beyond 5 clusters). The parallel coordinates suggest that this is a collective effect from multiple observables. In this way the cluster boundaries reveal correlations between parameters that are not obviously connected to a single observable. The clustering is in effect visualizing the combination of all the equations that connect the parameters to the observables.

  • The tool offers the possibility to compare clustering with or without correlations between the observables. In this example, this shows that if correlations are ignored, the relative importance of the angular observables with respect to \(R_{K^{(\star )}}\) increases and the main clustering now depends mostly on \(C_9\). This observation is consistent with the pull and residual analysis of [4].

  • Clustering pulls with fixed errors does not capture the reference point. However, the position of the reference point in observable space is the origin and this can be visualized in parallel coordinate plots without centering.

  • Clustering can help understand tensions between observables in a global fit. For example, the best fit point shown as a star lies near the intersection of two clusters in the right panel of Fig. 1. The two bins of \(P_5^\prime \) that contribute most to the discrepancy between SM and experiment prefer the light green cluster, whereas \(R_{K}\) prefers the purple cluster. In fact, Fig. 3 indicates that the clusters where \(P_5^\prime [4-6]\) and P\(_5^\prime [6-8]\) move closer to their experimental values, take \(R_{K}\) further away from its experimental value, clearly exposing the tension between these two observables.

  • Clustering with more than two parameters, in combination with slice displays, helps us understand multivariate relations, as demonstrated for the parameters \(C_9\), \(C_{10}\) and \(C_9^\prime \). Inspecting the slices provides additional information compared to the profiled results usually presented when fitting more than two observables by accurately capturing the shape of the boundaries as a function of the parameters in the space orthogonal to the viewing plane. A more obvious example of this can be seen in the example presented in the next section. In particular the first two panels of Fig. 23 illustrate a change in the correlation between two parameters (\(C_{VL}\) and \(C_{SL}\)) as a third one (\(C_T\)) changes.

  • Without prior knowledge to preselect the most important observables it is also possible to cluster based on a much larger set of observables. This can be used as a preliminary step for selecting smaller subsets for further study. Although the large set can also be studied with the tools in pandemonium, plots showing variations of individual observables across parameter space have to be done separately. The clustering results with large numbers of observables give an overall picture and improve the resolution; but subsets must be considered to study details such as internal tensions.

  • A study with a large set of observables can also be used to preselect a smaller subset for further study. For example, looking at 89 observables we can see that \(Br(B_s\rightarrow \mu ^+\mu ^-)\) and \(P_4^\prime (B_0\rightarrow K^*\mu ^+\mu ^-)[0.1-0.98]\) could play an important role under certain conditions and understand why. The importance of the former was already known and depends on how the different existing measurements are combined. The importance of the latter is inferred from its variation across parameter space and we estimate that a reduction in the current experimental error by about a factor of 4 is needed for this observable to have full impact on the global fit.

While the main focus here is on new insights that can be obtained through the investigation of the clustering outcome, we can also perform cross-checks to see if the results are aligned with what is expected based on previous studies in the literature. Consider as an example the case when \(C_9\) and \(C_{10}\) are the only non-zero WC for new physics. In this scenario it is known that \(Br(B_s\rightarrow \mu ^+\mu ^-)\) depends only on \(C_{10}\), whereas \(R_{K^{(*)}}\) depends on \(C_9-C_{10} \). The addition of \(Br(B_s\rightarrow \mu ^+\mu ^-)\) thus provides sensitivity to \(C_{10}\) in the clustering study, as can be seen directly in Fig. 14. When its error is sufficiently small for its pull to compete with that of \(R_K\) in importance, it produces main partitions along this direction as illustrated in Fig. 15.

5 Kinematic distributions in \(B\rightarrow D^{(\star )} \tau \nu \) decay and comparison with ClusterKing

In this section we compare our work to the Python package ClusterKinG, which was recently introduced to cluster kinematic distributions [3]. The interactive environment that we have described here can also be used to cluster kinematic distributions as we show below. Our work, as implemented in the pandemonium package, offers the following advantages

  • Flexibility to choose linkage methods, distance functions and numbers of clusters according to a variety of validation indices.

  • Visualization of the resulting clusters in both parameter and observable space, making use of different tools from the visualization literature that also allow for direct inspection of high dimensional distributions.

  • Easily include more than one distribution, which allows model differentiation in multiple directions in parameter space.

We use two of the examples in [3] for this comparison. They both relate to new physics that may affect the processes \(B\rightarrow D^{(\star )} \tau ^- \bar{\nu }_\tau \) (listed in Table 4). The models contain three parameters, the Wilson coefficients \(C_{VL}\), \(C_{SL}\) and \(C_T\) in the low energy effective Hamiltonian responsible for the \(b\rightarrow c\tau ^-\nu _\tau \) quark level transition. We generate predictions for the same grid of Wilson coefficients as [3]: ten equally spaced points for each parameter within the bounds \( -0.5 \le C_{VL},C_{SL} \le 0.5\), \(-0.1 \le C_T \le 0.1\) using flavio. The ranges chosen for the scan are motivated by the fits in [33]. The dominant uncertainties in future experiments are assumed to be statistical and their estimate is outlined in “Appendix D.” The ClusterKinG package is designed to compare the shapes of two distributions (histograms) and its clustering is based on the distance functionFootnote 16

$$\begin{aligned} \chi ^2\left( H_1,H_2\right) =\sum _{i=1}^N \frac{\left( N_1n_{2i}-N_2n_{1i}\right) ^2}{N_2^2\sigma _{1i}^2+N_1^2\sigma _{2i}^2}. \end{aligned}$$
(14)

Each histogram corresponds to a model point for one of the kinematic distributions mentioned above and \(N_{1,2}\) are the respective normalization.

The distance function in Eq. 14 differs from the ones we have considered so far in two important ways:

  • it compares the shapes of two distributions but ignores their overall normalization;

  • it cannot be implemented solely from coordinates because it depends on two independent quantities at each point, \(n_i\) and \(\sigma _i\).

The first point reflects a choice: an experiment may be better suited to measure precisely the shape of a distribution rather than its normalization. However, the difference between predictions of different models may be more prominent in integrated rates and this can be captured by the options in our work as we illustrate below. The second point can be circumvented with a user defined distance matrix which is calculated outside the app and imported as a csv file as we do in this section.

We begin with a direct comparison to [3] using the nine bins in the \(\cos \theta _\tau \) distribution and matching their clustering parameters: complete linkage for six clusters and an externally calculated distance matrix corresponding to Eq. 14. To estimate the errors we assume a yield of 1000 events to assign statistical uncertainties to each bin and add a 10% systematic error in quadrature. With these choices we find benchmarks that agree within errors with those reported in [3].Footnote 17

ClusterKinG uses an automatic stopping criteria to determine the number of clusters to be six whereas this is a choice in our case. That stopping criterion can be reproduced within our framework in terms of the maximum cluster diameter. Requiring that all points inside the cluster be indistinguishable, the criterion stops clustering at the smallest number of clusters for which the \(\mathrm{maximum~diameter} \le N\) (N is the number of bins) for the distance function \(d(X_k,X_l)=\chi ^2(H_1,H_2)\). In Fig. 22 we show the maximum cluster diameter and radius as a function of the number of clusters for complete linkage and distance function Eq. 14. Applying the condition \(\mathrm{maximum~diameter} < 9\) to our data (see footnote 17) suggests four clusters instead of six, but a careful look at the figure shows that the criterion is approximately satisfied already at three clusters and does not change much between four and six clusters. On the other hand we see a large drop in cluster radius when going from 4 to 5 clusters, suggesting that 5 clusters may be the preferred solution. This illustrates an advantage in determining the number of clusters after visually inspecting multiple criteria.

It is important to emphasize at this point that this stopping criterion has a very different interpretation from what we are doing, and is mentioned here solely for the purpose of comparing to our interface. ClusterKinG is comparing distributions corresponding to different parameter choices by asking how likely it is that both histograms are drawn from the same underlying normal distribution, and thus how likely it is that any difference is attributable to random fluctuations.

In previous sections we have proposed a different criterion based on maximum cluster radius (see the discussion in Sect. 4.2) with a different interpretation. In this example there are three parameters in the models, and if we use the coordinates of [3] there is one constraint (the normalization of the distribution). We thus use a \(\chi ^2\) distribution with two degrees of freedom to evaluate the confidence interval and require a maximum cluster radius of 2.3 which suggests that five, six or seven clusters are equally valid. When a stopping criterion exhibits flat behavior as in this example, one might decide to trade complexity for interpretability and pick the simplest scenario, the smallest number of clusters in the range where the index is approximately constant.

Fig. 22
figure 22

ClusterKinG stopping criterion to determine the number of clusters (left panel) and maximum cluster radius as a function of the number of clusters for complete linkage and distance function Eq. 14 for the \(\cos \theta _\tau \) distribution in \(B\rightarrow D^{\star } \tau ^- \bar{\nu }_\tau \)

In Fig. 23 we show three slices in parameter space along with the parallel coordinates for six clusters using complete linkage and the user defined distance in Eq. 14. The slices show good agreement with Figure 1 of [3]. The parallel coordinate plot shown in the same figure for user defined coordinates (see “Appendix D”) is roughly equivalent to the benchmark distribution plots produced by ClusterKinG (e.g. Figure 2 of [3]).

Fig. 23
figure 23

Six clusters with complete linkage and user defined distance function Eq. 14 for the \(\cos \theta _\tau \) distribution: slices in parameter space are shown in the \(C_{VL}-C_{SL}\) plane with \(C_T=-0.1\) (left) and \(C_T=0.1\) (center); and in the \(C_{VL}-C_{T}\) plane with \(C_{SL}=0.5\) (right). The parallel plot for user defined coordinates is shown in the bottom panel

Going beyond a direct comparison with ClusterKinG we have access to all the options discussed previously. For example we can combine more than one kinematic distribution by labeling the bins with sequential coordinates as listed in “Appendix.” To avoid unnecessary complications associated with correlated errors, we illustrate this by combining the \(\cos \theta _\tau \) distribution of \(B\rightarrow D^{\star } \tau ^- \bar{\nu }_\tau \) with the \(q^2\) distribution of the branching ratio \(B\rightarrow D \tau \nu \). In this case we are interested in the full information about the model carried by the observables, including the normalization, and we revert to using the pull coordinates.

The first thing to notice is that the resolving power of the combined distributions, as measured for example by the maximum cluster radius, increases to more than eight clusters. For simplicity then, we choose four clusters to illustrate a few possibilities. In Fig. 24 we show the parallel coordinates using average linkage and Euclidean distance on the pulls.

Fig. 24
figure 24

Parallel coordinates of the combined distributions \(\cos \theta _\tau \) in \(B\rightarrow D^{\star } \tau ^- \bar{\nu }_\tau \) and \(q^2\) in \(B\rightarrow D \tau \nu \) for four clusters using average linkage and Euclidean distance on the pulls

The first nine coordinates can be directly compared to Fig. 23. Note that while the lines showing the different models were crossing in a parallel coordinate plot of the user defined coordinates (i.e. for normalized histograms) this does not happen here, because the normalization of the distributions is different for each model. The same effect enhances the importance of the first few and last few bins in Fig. 23 and diminishes that of \(O_{4-6}\) for example. Using the pulls, all the nine bins of the \(\cos \theta _\tau \) distribution in \(B\rightarrow D^{\star } \tau ^- \bar{\nu }_\tau \) have comparable importance in the clustering. The \(q^2\) distribution of the branching ratio for \(B\rightarrow D \tau \nu \) has a different behavior. The last two bins appear as the dominant coordinates and would have a large effect on the clustering. As before, we can choose to enhance the importance of these coordinates by using complete linkage and maximum distance, or suppress it by using Manhattan distance.

The complementarity of the two distributions can be seen in Fig. 25 where we show slices of the clusters in parameter space in the \(C_{VL}-C_{SL}\) plane with \(C_T=-0.1\) (top) and in the \(C_{VL}-C_{T}\) plane with \(C_{SL}=0.5\) (bottom). The variation of predictions across parameter space for the same slices is also shown for the coordinates \(O_1\) and \(O_{18}\). All coordinates in the \(\cos \theta _\tau \) distribution in \(B\rightarrow D^{\star } \tau ^- \bar{\nu }_\tau \) exhibit a similar pattern to \(O_1\) that can separate \(C_{VL}\) better than \(C_{SL}\), while most coordinates in the \(q^2\) distribution of the branching ratio for \(B\rightarrow D \tau \nu \) exhibit a pattern similar to \(O_{18}\), showing a correlation between \(C_{VL}\) and \(C_{SL}\). The situation is opposite in the bottom panel, where the \(\cos \theta _\tau \) distribution (illustrated by the behavior of \(O_1\)) exhibits a correlation between \(C_{VL}\) and \(C_{T}\) and the \(q^2\) distribution provides little resolving power along \(C_T\). The dominance of the last few \(q^2\) coordinates is reflected in the shape of the cluster boundaries in the respective slices.

Fig. 25
figure 25

Slices of four clusters using average linkage and Euclidean distance on the pulls in parameter space: the \(C_{VL}-C_{SL}\) plane with \(C_T=-0.1\) (top), and the \(C_{VL}-C_{T}\) plane with \(C_{SL}=0.5\) (bottom). The variation of predictions across parameter space for the same slices is also shown for the coordinates \(O_1\) and \(O_{18}\)

Tours of the clusters in observable space in this example also reveal the intrinsic dimensionality. With the original ClusterKing setup there are three parameters but a constraint that requires the same normalization for all models resulting in a strip (2D surface) in nine dimensions. Using instead the pulls with Euclidean distance, without the normalization constraint, the 3D nature of the clusters in observable space is revealed. In Fig. 26 we illustrate this with two still plots from tours of six clusters in observable space for two and four parameters containing links to animated gifs.

Fig. 26
figure 26

Still image from the tour showing six clusters with complete linkage in 9D observable space for the \(\cos \theta _\tau \) distribution in \(B\rightarrow D^{\star } \tau ^- \bar{\nu }_\tau \) with user coordinates and distance to match the ClusterKing setup (left, see animation here) and with euclidean distance on the pulls (right, see animation here). Right plot shows t-SNE view of the 9D observable space also illustrating that this is a 1D structure

5.1 Salient points

Pandemonium is an interactive graphical interface which provides multiple tools to study clustering. It can be used in a manner that mimics ClusterKing but offers many ways to go beyond it. A few key differences are:

  • Pandemonium allows flexible coordinate input to suit a specific problem. Without additional user input, Pandemonium relies on pulls, which are useful to compare models against each other. ClusterKing uses coordinates appropriate to determine how likely it is that two histograms are drawn from the same underlying distribution.

  • We have illustrated that stopping criteria are very sensitive to small numerical differences. In these scenarios our approach of visualizing a range of statistics would be preferred over the automatic stopping criterion used in ClusterKing.

  • Pandemonium allows the user to include multiple histograms or other observables when comparing different models. It also allows the user to select from a menu of possible linkages and distance functions and we have shown how this flexibility can be used to obtain new insights on the data.

  • We use a range of different high dimensional visualization of the observable space using parallel coordinate plots, dimension reduction and a tour display in the same interface, allowing a better understanding of the clustering outcome.

6 Summary and conclusions

This paper explores the use of hierarchical clustering in combination with data visualization and an interactive user interface for applications in physics. The work has been implemented in the interactive tool pandemonium, best suited for clustering (in observable space) multiple predictions organized as an array of coordinates normalized to uncertainty. The results of the clustering outcome can then be visualized in parameter space. We have included examples that study different problems from flavor physics, showing how clustering can be used to complement a global fit (Sect. 4) and understand the potential of future measurements (Sect. 4.7). We have also included a comparison to the approach presented in ClusterKing (Sect. 5).

By relating the clustering outcome with a set of linked displays, we illustrate how to explore relations between parameter and observable space. This exploration benefits from the interactive environment, which enables easy comparison of the results with different settings for the clustering. The different settings can be chosen judiciously to enhance or suppress the importance of particular observables.

The applications included in this paper present insights into well studied and understood B physics problems which serve to hone the method, and we summarized our physics results in Sect. 4.9. The approach and software tools can be applied to a large number of phenomenological studies in particle physics and beyond, promising new insights in particular in scenarios that are not as well understood.

Consider for example new physics models that introduce new particles, e.g. models with an extended Higgs sector or supersymmetric models, where we may wish to understand the connection between the model parameters and the physical particle masses at the weak scale. Here we might use clustering to understand the connections between these two parameter spaces (rather than with an experimental observable space).

The observable space differs from the preferred theoretical parametrization in a range of fields. Contrary to particle physics, the transfer functions between the two spaces might not be known but need to be estimated from the data as well, see [34] for an example from hydrology. In this case the clustering approach could help understand these connections better. More generally speaking, this approach can be useful whenever we aim to better understand a complex connection between two representations of the same points. A very general example is the understanding of statistical models with multiple latent variables that are inferred from the observed ones (potentially using complex black-box models).

We have made the software available on GitHub in the form of an R package, https://github.com/uschiLaa/pandemonium.