1 Introduction

Many problems in physics can be broadly characterized as a description of a large number of observations with models that contain multiple parameters. It is common practice to perform a global fit to the observations to arrive at the set of parameter values that best fits the data. To understand how well this fit describes the observations, a series of one or two-dimensional projections of confidence level regions are usually provided.

It is desirable to visually inspect the results of such fits to gain insight into their structure. One possibility is to directly compare the predictions of different parameter sets in the vicinity of the best fit. A simple algorithm to organise this idea that results in a manageable number of such parameter sets can be constructed using singular value decomposition (SVD). One first decides the confidence level at which to make the desired comparison and quantifies it with the corresponding \(\Delta \chi ^2\) for the appropriate number of parameters being fit, n. The region in parameter space within the desired confidence level is approximately an n-dimensional ellipsoid, and SVD provides an ideal set of 2\(\times n\) points on which to evaluate the predictions of the model for visual inspection. These points are given by the intersections of the ellipsoid with its principal axes and clearly provide a minimal sample of parameter space that covers all relevant directions at a desired confidence level.

A tool for the direct visualisation of the high dimensional model predictions thus constructed has existed in the statistics literature for many years, but has not been applied to high energy physics problems recently.Footnote 1 It is called a tour, and is a dynamic visualization of low-dimensional projections of high-dimensional spaces. The most recent incarnation of the tool is available in the R [3] package, called tourr [4]. The goal of this paper is to introduce the use of a tour as a visualisation tool for sensitivity studies of parton distribution functions (PDFs) building on the formalism that has been developed over the years by the CTEQ collaboration. It is beyond the scope of this article to provide a detailed analysis of the PDF uncertainties. The choice of this example has two motivations: the PDF fits embody the generic problem of multidimensional fits to large numbers of observables that are common in high energy physics; and Ref. [1] has recently provided the parameter sets for this problem in an initial effort to visualize the PDF fits. Our starting point will be the PDFSense [1] results but our study differs in an important way: PDFSense utilizes the Tensorflow Embedding Projector [5], limiting visualisation to three of the first ten principal components, that is, a 3-d subspace, whereas the tour allows us to explore the full space. As we will see here, this allows additional insights into the fits.

Our paper is organised as follows. In Sect. 2 we first describe the problem as formulated in Ref. [1] and we discuss a toy example to illustrate the concepts involved. We then introduce the tour algorithm and its implementation in Sect. 3. Finally we discuss the results obtained by applying tour to the PDFSense dataset in Sect. 4 and present our conclusions in Sect. 5.

2 PDF fits and residuals

The analysis of collider physics results relies on theoretical calculations of cross-sections and distributions. Factorization theorems allow us to bypass non-perturbative physics that cannot be calculated from first principles and to describe instead, the initial state of a reaction in terms of parton distribution functions or PDFs. These consist of simple functional forms describing the probability density for finding a given quark or gluon in the proton with a given momentum fraction x, at a given momentum transfer scale Q, in the lowest order approximation. The PDFs used today have been constructed by fitting high energy physics data collected over many years by multiple experiments and are produced by large collaborations. As such, they constitute an ideal example of a multidimensional parameter fit to a large data set to study with a tour.

For our study we will make use of the framework for treating uncertainties of the PDF predictions as has been defined in [6, 7]. The best fit PDF, defined by the set of n parameters \(a^0_i\), is obtained by finding the global minimum of a \(\chi ^2\) function. To study uncertainties in the fit one considers small variations of the parameters around the minimum using a quadratic approximation for the \(\chi ^2\) function written in terms of the Hessian matrix of second derivatives at the minimum, H. The eigenvectors of this matrix provide the principal axes of the confidence level ellipsoids around the global minimum, and one defines a displacement along these directions to find the n dimensional set of points \(a_i\) which provide 2n PDF sets that differ from the best fit by a desired confidence level.

Reference [1] has introduced the package PDFSense to study the sensitivity of different experiments to different aspects of the PDFs. An ingredient of that study are the so-called shifted residuals which are related to the experimental error contribution to the \(\chi ^2\) by [8]

$$\begin{aligned} \chi ^2_E (\vec {a}) = \sum _{i=1}^{N_d} r^2_i(\vec {a}) + \sum _{\alpha =1}^{N_{\lambda }}\bar{\lambda }_{\alpha }^2(\vec {a}) \end{aligned}$$
(1)

where the \(\bar{\lambda }_{\alpha }\) are the best-fit nuisance parameters. The shifted residuals \(r_i(\vec {a})\) are calculated as the difference between the theoretical prediction \(T_i(\vec {a})\) and the shifted central data value \(D_{i,sh}(\vec {a})\), normalised by the total uncorrelated uncertainty \(s_i\),

$$\begin{aligned} r_i(\vec {a}) = \frac{1}{s_i}(T_i(\vec {a}) - D_{i,sh}(\vec {a})). \end{aligned}$$
(2)

Note that \(D_{i,sh}(\vec {a})\) is the observed central value shifted by a function of the optimal nuisance parameters \(\bar{\lambda }_{\alpha }\) and therefore depends on the point in parameter space considered. The so-called response of a residual to an experimental result i is then defined as [1]

$$\begin{aligned} \delta _{i,l}^{\pm } \equiv (r_i(\vec {a}_l^{\pm })-r_i(\vec {a}_0))/\langle r_0\rangle _E \end{aligned}$$
(3)

with \(\langle r_0\rangle _E\) the root-mean squared residuals characterizing the quality of fit to experiment E, following from Eq. 1

$$\begin{aligned} \langle r_0\rangle _E \approx \sqrt{\frac{\chi ^2_E(\vec {a}_0)}{N_d}}. \end{aligned}$$
(4)

This residual response parameterizes the change in residuals with variations along the independent directions \(\vec {a}_l^{\pm }\).Footnote 2

Large values of \(\delta _{i,l}^{\pm }\) therefore indicate considerable variation in the theory prediction values within the selected window of allowed probability variation along the considered direction. We thus consider a 2N dimensional vector

$$\begin{aligned} \vec {\delta }_i=\{\delta _{i,1}^{+},\delta _{i,1}^{-},\ldots ,\delta _{i,N}^{+},\delta _{i,N}^{-}\}. \end{aligned}$$
(5)

for each data point (i.e. experimental result). Concretely, here we consider a 56 dimensional parameter space in which we want to compare and group the experimental results. These responses \(\vec {\delta }_i\) are calculated and provided by Ref. [1] and they constitute the starting point of our study.

Fig. 1
figure 1

For illustrative purposes, two data sets of gluon parton distribution function, in the form \(p(x)\pm \Delta p(x)\) for 15 and 16 values of x, respectively (shown in red and blue). The left (right) panel shows the low (high) x region respectively

Fig. 2
figure 2

Difference between the \(\chi ^2\)-function (black), and quadratic approximation (orange). Their intersection with a 95% confidence level plane is shown on the right panel. The intersections of the principal axes with the ellipse (that occurs in the quadratic approximation) are shown as the black dots in the right panel. The numbers label the eigenvector of H corresponding to that direction

2.1 Simple illustrative example

The procedure described so far has been used for many years, but it is complicated. For newcomers to the field, we illustrate it here using a simple example drawn from two early data sets for the gluon parton distribution function extracted from two types of \(\psi \) production experiments [9]. This example will allow us to illustrate all the concepts involved. In Fig. 1 we show these two data sets, labelling the points and their error bars \(p(x)\pm \Delta p(x)\), for 15 and 16 values of x (in red and blue) respectively. The points are fit to the two-parameter function

$$\begin{aligned} g(a,b,x)=\frac{1}{2}(1+b)(1-x)^b x^a, \end{aligned}$$
(6)

similar to but simpler than the forms used today. The next step is to minimise the \(\chi ^2\)-function defined by

$$\begin{aligned} \chi ^2(a,b)=\sum _{x_i}\left( \frac{g(a,b,x_i)-p(x_i)}{\Delta p(x_i)}\right) ^2. \end{aligned}$$
(7)

The parameters \(a_0,b_0\) that result in the global minimum \(\chi ^2(a,b)_\mathrm{min}\) define the best fit to the data. They are shown as the cross in the right panel of Fig. 2, and produce the solid black curve shown in Fig. 1. At the same time one adopts a quadratic approximation to the \(\chi ^2\) function in the vicinity of its minimum

$$\begin{aligned} \chi ^2(a,b)\approx \chi ^2(a_0,b_0) +\frac{1}{2} \left( \begin{array}{cc} a-a_0&b-b_0 \end{array} \right) \nonumber \\ \left( \begin{array}{ll} \frac{\partial \chi ^2(a,b)}{\partial a^2} &{} \frac{\partial \chi ^2(a,b)}{\partial a\partial b} \\ \frac{\partial \chi ^2(a,b)}{\partial a\partial b} &{} \frac{\partial \chi ^2(a,b)}{\partial b^2} \end{array} \right) _0 \left( \begin{array}{c} a-a_0 \\ b-b_0 \end{array} \right) , \end{aligned}$$
(8)

where the matrix of second derivatives evaluated at the global minimum is the well-known Hessian. This approximation seems unnecessary for the simple example we are discussing now but is used for the current global fits offering complementary features to exact numerical methods [10]. To quantify the error in the fit one then constructs the region in ab parameter space corresponding to a given confidence level. For our example we take \(\chi ^2(a,b)-\chi ^2(a_0,b_0)\le 5.99\) which corresponds to a 95% confidence level in the estimation of two parameters. The intersection of the plane \(\chi ^2(a,b)=\chi ^2(a_0,b_0)+ 5.99\) (green) with the \(\chi ^2(a,b)\) function (shown in black) and its quadratic approximation (in orange) is shown in the left panel of Fig. 2. The right panel in the same figure shows the ellipsoid (two-dimensional in this case) defined by this intersection for the quadratic approximation (in orange) and the deformed ellipsoid in black for the exact \(\chi ^2(a,b)\) function. The difference between the two is small indicating that the quadratic approximation is quite adequate for this confidence level. The eigenvectors of the Hessian matrix provide the directions of the principal axes of the ellipsoid and are shown in black in the right panel of Fig. 2: the dashed (dotted) lines correspond to the direction associated with the largest (smallest) eigenvalue. The intersections of these axes with the ellipse, shown as black dots, provide a set of fits to the data that can be compared with the best fit and used as a means of quantifying the uncertainty in the fitting procedure. These are also shown in Fig. 1.

The set of responses, \(\delta _{i,l}^{\pm }\), in this example is shown in Fig. 3. From inspecting the limiting behaviour of Eq. 6 it is clear that the description at low x is dependent mainly on a while large values of x are mostly sensitive to b. This is reflected in the uncertainty curves in Fig. 1, and also when looking at the \(\delta \)s. For this simple example the main directions identified by the Hessian method are in fact well aligned with the original directions in parameter space. Considering the values of \(\delta \) we find that \(\delta _1^{\pm }\), which corresponds mainly to a variation of a, takes large values for bins with low values of x, while \(\delta _2^{\pm }\) takes large values for bins with large values of x. We conclude that the parameter dependence is captured by the \(\delta \)s as expected. Going to more complex descriptions and fits, as we do in the following, this correspondence is no longer clear from the description and the \(\delta \) values may be used to infer the parameter dependence of a given prediction.

Fig. 3
figure 3

The \(\delta \) parameter space of the simple illustrative example: \(\delta _i^+\) form the axes and color indicates the respective value of x. Note that only \(\delta _i^+\) is shown because for this problem the \(\delta _i^-\) directions contain the same information. Labelled points are the same as those labelled in Fig. 1, and illustrate key features of the fits

In Figs. 1 and 3 we have labelled the following four points:

  1. 1.

    point with highest value in \(\delta _1\), found at low x and with small error bar

  2. 2.

    point with parametrized highest value in \(\delta _2\), also has the highest value of x

  3. 3.

    point that is not well described by the fits, but has small values of \(\delta \)

  4. 4.

    point with intermediate value of x and small errors result in larger values in both \(\delta \) directions.

These observations illustrate that large values of \(\delta \) correlate with points with errors that are comparable to or smaller than the uncertainty in the fit as parametrized by the Hessian method. At the same time, points that are not well described by the fits do not necessarily result in large \(\delta \)s.

3 Data visualisation

When looking for structure in high dimensional parameter spaces we rely on tools for dimensional reduction and visualisation. Due to the importance of this task, many methods have been developed. Here we give a brief overview of the tools used in the following work. Note that in the following we adopt a broader definition of the word “data” generally used in statistics, which is not restricted to experimental results.

3.1 Dimension reduction

3.1.1 Principal component analysis

Principal component analysis (PCA) is an orthogonal linear transformation of elliptical data into a coordinate system, such that the first basis vector aligns with the direction of maximum variance. The second basis vector is the direction of maximum variation orthogonal to the first coordinate, and the remaining basis vectors are sequentially computed analogously. It is typically used for dimension reduction. To choose the number of principal components (PCs) to use, the proportion of variance explained by each component is examined,

$$\begin{aligned} v_i^{prop} = v_i \Big / \sum _j v_j, \end{aligned}$$
(9)

with \(v_i\) the variance in the direction of PC i. Either a pre-determined proportion of total variance is used, or by plotting the proportions against the number of PCs and choosing the point where this flattens to zero. PCA is an optimization problem with a well defined solution. However, the outcome of the PCA is affected by the preparation of the input data. The preparation can also be used to highlight specific aspects of the data distribution. For example, the input data is generally centered before performing PCA by setting each variable to have a mean value of zero. In this way, large variation describing only mean values different from zero are removed from the results. Another approach would be to normalize the distribution, to emphasize directional information. Typically this means “sphering” of the data points, by normalizing each vector to have length one. This results in comparison of similar, or different, directions in the parameter space, but information about the differences in length are lost by this approach.

In this work we use the standard implementation prcomp in R for the computation of the principal components.

3.1.2 Nonlinear embeddings

It is also common to examine non-linear mapping of the data points onto a low dimensional embedding. The aim is to preserve multidimensional structure by minimizing the difference in distances in the full parameter space as compared to distances in the low dimensional projection. PCA is a simple member of this more general type of transformation. A widely used method in machine learning is the algorithm called t-distributed stochastic neighbor embedding (t-SNE) [11]. It has a goal to cluster similar points together (i.e. points with small Euclidean distance) while separating the individual clusters from one another. This gives appealing and often useful pictures but results should be considered with care as t-SNE is a nonlinear transformation and does not preserve original distance. Note that while nonlinear embeddings may be useful in identifying clusters in the data, their interpretation is limited by lack of an analytical description of the transformation. This is not the case for linear transformations such as the PCA, where the transformation can be readily reversed to identify the contribution of the original parameters to a given principal component direction.

3.2 Tour algorithm

3.2.1 Overview

When a data set has more than two parameters, the tour [12] can be used to plot the multiple dimensions. Currently the typical approach is to plot two parameters or pairs of combinations of the parameters. The tour extends this idea to plot all possible combinations. The viewer is provided with a continuous movie of smooth transitions from one combination to another, from which it is possible to extrapolate the shape of the parameter space in high-dimensions. Seeing many combinations in quick succession shows the associations between all the parameters.

There are several types of tours. Here we use a grand tour, of projections from n-dimensional parameter space to 2-d projections space. A projection of data is computed by multiplying an \( m \times n\) data matrix, \(\mathrm{\mathbf{X}}\), having m sample points in n dimensions, by an orthonormal \(n \times d\) projection matrix, \(\mathrm{\mathbf{A}}\), yielding a d-dimensional projection. The grand tour is a mechanism for choosing which projections to display, and how the smooth transitions happen. New projections are chosen from all possible projections, and a geodesic interpolation to a target projection provides the smooth transition. The original algorithm is documented in [13]. The implementation used in this paper is from the tourr [4] package in R [3].

The tour shows linear projections of the parameter space. In contrast, methods like t-SNE [11] produce non-linear mappings from high- to low- dimensional space. The difference is that the shape of the data in high-dimensions is preserved by linear projections, but not with nonlinear mappings.

3.2.2 Algorithm

A movie of data projections is created by interpolating along a geodesic path from the current (starting) plane to the new target plane. In the grand tour, the target plane is chosen by randomly selecting a plane. The interpolation algorithm (as described in [14]) follows these steps:

Table 1 Summary of key findings, comparing observations made with visualising PDFSense results with the TFEP and with additional insights that can be made using tour. A complete list of experimental datasets together with their CTEQ labelling IDs is given in Appendix A
  1. 1.

    Given a starting \(n\times d\) projection \(\mathrm{\mathbf{A}}_a\), describing the starting plane, create a new target projection \(\mathrm{\mathbf{A}}_z\), describing the target plane. It is important to check that \(\mathrm{\mathbf{A}}_a\) and \(\mathrm{\mathbf{A}}_z\) describe different planes, and generate a new \(\mathrm{\mathbf{A}}_z\) if necessary. To find the optimal rotation of the starting plane into the target plane we need to find the frames in each plane which are the closest.

  2. 2.

    Determine the shortest path between frames using singular value decomposition. \(\mathrm{\mathbf{A}}_a'\mathrm{\mathbf{A}}_z=\mathrm{\mathbf{V}}_a\Lambda \mathrm{\mathbf{V}}_z', ~~~\Lambda =\text{ diag }(\lambda _1\ge \dots \ge \lambda _d)\), and the principal directions in each plane are \(\mathrm{\mathbf{B}}_a=\mathrm{\mathbf{A}}_a\mathrm{\mathbf{V}}_a, \mathrm{\mathbf{B}}_z=\mathrm{\mathbf{A}}_z\mathrm{\mathbf{V}}_z\), a within-plane rotation of the descriptive bases \(\mathrm{\mathbf{A}}_a, \mathrm{\mathbf{A}}_z\) respectively. The principal directions are the frames describing the starting and target planes which have the shortest distance between them. The rotation is defined with respect to these principal directions. The singular values, \(\lambda _i, i=1,\dots , d\), define the smallest angles between the principal directions.

  3. 3.

    Orthonormalize \(\mathrm{\mathbf{B}}_z\) on \(\mathrm{\mathbf{B}}_a\), giving \(\mathrm{\mathbf{B}}_*\), to create a rotation framework.

  4. 4.

    Calculate the principal angles, \(\tau _i = \cos ^{-1}\lambda _i, i=1,\dots , d\).

  5. 5.

    Rotate the frames by dividing the angles into increments, \(\tau _i(t)\), for \(t\in (0,1]\), and create the ith column of the new frame, \(\mathrm{\mathbf{b}}_i\), from the ith columns of \(\mathrm{\mathbf{B}}_a\) and \(\mathrm{\mathbf{B}}_*\), by \(\mathrm{\mathbf{b}}_i(t) = \cos (\tau _i(t))\mathrm{\mathbf{b}}_{ai} + \sin (\tau _i(t))\mathrm{\mathbf{b}}_{*i}\). When \(t=1\), the frame will be \(\mathrm{\mathbf{B}}_z\).

  6. 6.

    Project the data into \(\mathrm{\mathbf{A}}(t)=\mathrm{\mathbf{B}}(t)\mathrm{\mathbf{V}}_a'\).

  7. 7.

    Continue the rotation until \(t=1\). Set the current projection to be \(\mathrm{\mathbf{A}}_a\) and go back to step 1.

In a grand tour the target plane is drawn randomly from all possible target planes, which means that any plane is equally likely to be shown. That is, we are sampling from a uniform distribution on a sphere. To achieve this, sample n values from a standard univariate normal distribution, resulting in a sample from a standard multivariate normal. Standardize this vector to have length equal to one, gives a random value from a \((n-1)\)-dimensional sphere, that is, a randomly generated projection vector. Do this twice to get a 2-dimensional projection, where the second vector is orthonormalized on the first.

The data typically needs some standardization or scaling before computing the tour. This is because we are considering linear combinations of the different parameter directions and differences in overall range might otherwise dominate the resulting display.Footnote 3 This can be as simple as centering each variable on 0, and standardizing to a range of − 1 to 1. It could be as severe as sphering the data which in statistics means that the data is transformed into principal components (from elliptical shape to spherical shape). The same term is used for a different type of transformation in other fields, where observations are scaled to fall on a high-dimensional sphere, by scaling each observation to have length 1. (An interesting diversion: this type of sphering is the same transformation made on multivariate normal vectors to obtain a point on a sphere, to choose the target planes in the grand tour.)

The initial description of the tour promised display of all possible projections. Theoretically this is true, but practically it would require that the user stay watching forever! However, the coverage of the space is fairly fast, depending on n, and within a short time it is possible to guarantee all possible projections are displayed within an angle of tolerance.

3.2.3 Display

For physics problems, setting \(d=2\) would be most common. The projected data is displayed as a scatterplot of points. It is also possible to overlay confidence regions, or contours. Groups in the data can be highlighted by color. Displaying the combination of variables of a particular projection can be useful to interpret patterns. This can be realized by plotting a circle with segments indicating the magnitude and direction of the contribution, and it is called the axes.

The same tour path can be used to display subsets of the data, in different plots, to compare groups. When we break the display into subsets, the full data is also shown in each plot, in light grey. This makes it easier to do group comparison.

Fig. 4
figure 4

Projections obtained with TFEP, where principal components 3, 5 and 8 have been selected, and the view was rotated such that the jet+\(t\bar{t}\) cluster is roughly orthogonal to the DIS cluster. The top left plot shows grouping into jets+\(t\bar{t}\) (red), DIS (blue) and VBP (orange), the remaining plots highlight subgroups (indicated by CTEQ labelling IDs shown in the appendix) of the jets+\(t\bar{t}\) cluster in the same view

4 Results

This section compares the findings made using the tour relative to those made with PDFSense using the recent CT14HERA2 fits [15]. The PDFSense results form the basis on which to expand the knowledge of PDF fits. The results from both tools are summarized in Table 1, where PDFSense results were obtained using the TensorFlow Embedding Projector (TFEP) software [5] for the visualisation of high-dimensional data. The summary statistic “reciprocated distance” referenced in Table 1 is defined as:

$$\begin{aligned} \mathcal {D}_{i}\ \equiv \ \left( \sum _{j\ne i}^{N_{ all }}\frac{1}{|\vec {\delta }_{j}-\vec {\delta }_{i}|}\right) ^{-1}. \end{aligned}$$
(10)

This pair-wise distance measure will take larger values for experimental results with residual responses different from most other results considered, and small values if the responses are similar to most other results. For the example shown in Fig. 3 the largest value of reciprocated distance is found for point 2, followed by point 4 and point 1. Point 3 on the other hand has a reciprocated distance that is about a factor 10 below the maximum one since it is found to have \(\delta \) values close to the majority of other data points. \(\mathcal {D}_{i}\) can therefore be used to quantify similarity, enabling for example the identification of systematically different experimental results. TFEP provides two methods, PCA and t-SNE, and [1] is exploring both for the visualisation of the data set. The PCA implementation returns projections onto the 10 first PCs evaluated from centered and sphered data, and allows the user to choose two or three of them to view the results.

4.1 Results from PDFSense & TFEP

For comparison we first reproduce results similar to those found in [1] by using the TFEP software. A selection of four views is shown in Fig. 4, for a complete set of plots related to the PDFSense column in Table 1 we refer the reader to [1]. The selected examples show how the view was chosen based on orthogonality of assigned groups, and how for the example of the jet+\(t\bar{t}\) group the various contributions have been compared.

We can identify several limitations in using the TFEP software for the visualisation:

  • Relevant information about the distributions is encoded in more than 3 dimensions. This is clear as PCs 3, 5 and 8 have been selected in the visualisation, thus the majority of variation in the data is not captured in Fig. 4. Moreover, the application of t-SNE clustering shown in [1] results in a large number of clusters, indicating higher dimensional structure. It would be preferable to display it as a linear projection for which interpretations are straightforward.

  • The sphering of data points when preparing the PCA visualisation is removing relevant information about the length of the vectors \(\vec {\delta }_i\).

  • In addition while the online tool allows highlighting of groups it is considerably less flexible in selecting options compared to scripted tools like the tour, limiting the detail in which the results can efficiently be studied.

We next explore how these points can be addressed, in particular in the framework of dynamical projections and the tour algorithm.

4.2 Expanded findings made using the tour

We first optimize the number of principal components considered in our study, and then show how the tour results expand on previous observations, as was summarized in Table 1. The mapping from the original \(\delta \) coordinates onto the PCs for all PCAs considered in this work are listed in Appendix B.

4.2.1 PCA, normalisation and variance explained

In the following we study two sets of principal components (PCA1, PCA2), corresponding to the two data preparation choices described above (i.e. PCA1 = centered, and PCA2 = centered and sphered). Results from each are compared. Note that for this problem, the centering has negligible impact on the results as the mean value in each direction \(\delta _{i,l}^{\pm }\) is close to zero.

An important consideration is the number of PCs that contain relevant information. To study this we show in Fig. 5 the proportional variance (see Eq. 9) that is explained by the principal components, for the two choices of the PCA, with labels “Centered” for PCA performed on centered data (PCA1) and “Sphered” for the PCA obtained for centered and sphered data (PCA2) thus reproducing results from Fig. 4. We find a steep curve for the first few PCs, followed by a slow decay of the proportional variance, and the curve only flattens out towards zero around PC30. As a consequence we expect that looking at a 3 dimensional subset of the first 10 PCs is not sufficient to understand the variation in the considered parameter space, and that judging similarity based on the view in Fig. 4 only, is misleading.

Fig. 5
figure 5

Proportional variance explained by the principal components of the 56 dimensional parameter space. To capture all the variation, one would need close to 30 principal components, but around 6 captures about 50% of the variation. Both data preparations produce similar variance explanation, but the differences are enough to matter in some interpretations

In the following we want to study a higher dimensional subspace where we base the number of dimensions considered on the results found in Fig. 5.

For simplicity, we illustrate the tour approach using just the first 6 PCs, which captures about 50% of the overall variation (Table 2).This is sufficient to provide new insights as compared to Fig. 4 (left), and additional PCAs can be added for detailed studies of subgroups as we do below.

Table 2 Cumulative variance as % explained by the first 15 PCs

4.2.2 Grand tour result details

A short tour path is generated, of 20 basis planes and associated interpolation between them, of 2 dimensional projections of 6-d. This is used to compare between multiple groups. The examples considered are guided by findings in [1] and are summarized in Table 1.

Grouping of data points We first consider a display corresponding to Fig. 4 (left), i.e. the data set is grouped into three main clusters. Selected views from the animation are shown in Fig. 6, PCA1 (left) and PCA2 (right). The same colors used in Ref. [1] indicate the grouping: the DIS cluster is shown in blue, VBP in orange and the jets cluster in red. The first window in the display shows the axes, the other windows show the projected data, where one group is highlighted in color, while the remaining points are shown below in grey for easy comparison. As can be seen from the selected views, in any particular static view it is only possible to separate two of them at a time. The static views are not sufficient to convey the full picture obtained by watching the tour animation which allows to separate all three groups. The tour indicates that there is higher dimensional structure in the data points as can be seen in the linked animation.

In addition, it is possible to visually identify substructure within the clusters (e.g. groups of points aligned along some direction) as well as outlying points. This is especially true for PCA1 which is found to provide a much clearer picture than PCA2. We also find that the DIS and VBP clusters extend in multiple directions, while the jets cluster seems to be well described in a single plane.

Fig. 6
figure 6

Selected views from the grand tour results of the full dataset. The data points are grouped into DIS, VBP and jets cluster, shown in blue, orange and red respectively. Top left plot shows the projection of the PCs, and other plots show the three subgroups. Colour indicates group, and grey shows the entire data set, as a reference in order to make comparisons between groups. PCA1 (see animation here) is shown on the top row, PCA2 (see animation here) on the bottom row, the left views show a separation between DIS and jets clusters, the right views show the multidimensionality in the DIS cluster

Fig. 7
figure 7

Focusing on the jets cluster, showing only the first 4 PCs. Top left plot shows the projection coordinates, groups (Tevatron, ATLAS7old, \(\ldots \)) are focused in black in each plot, and grey shows all the data enabling direct comparison between subgroups. This view from the grand tour was selected because it clearly separates the outlying point in the ATLAS7new dataset. In addition the view also illustrates how the CMS results extend the reach away from the main cluster (see animation here)

The jet cluster In more detail, we investigate the jet cluster. These results are of special interest since they contain indeed the largest data sets to be added in the fit, which were indeed found to be important according to [1]. In addition, the new experimental data from LHC jet measurements is of interest because of possible tensions such as the systematic offsets in opposing directions for different rapidity bins observed in the ATLAS measurements, see [16] for a general discussion of the issue. As pointed out in [16] tensions can be reduced when adapting the treatment of systematic uncertainties, but cannot be fully resolved [17]. As seen above the jet cluster appears to be described in a lower dimensional subspace. Indeed performing PCA on the results in the jet cluster alone we see that the cumulative proportional variance reaches 49/75/91/95 % for PC1/2/3/4 respectively, with the proportional variance dropping to less than 2% for PC5. We therefore study substructure in this 4 dimensional space. While [1] distinguish three types of groups, i.e. “old” jet results (those included in the CT14HERA2 fit), “new” jet results (more recent ATLAS and CMS results) and \(t\bar{t}\), it makes sense to differentiate the LHC results further by experiment and \(\sqrt{s}\) (motivated also by the differences in sensitivities observed in [1]). For simplicity we consider only the results from performing PCA on the centered data shown in Fig. 7 with grouping into: Tevatron (IDs 504, 514), ATLAS7old (535), CMS7old (538), CMS7new (542), ATLAS7new (544), \(t\bar{t}\)-energy (565, 567), \(t\bar{t}\)-rap (566, 568) and CMS8 (545). Indeed we observe that the Tevatron results as well as the ATLAS results generally fall in the center of the cluster, with exception of some outlying points. On the other hand CMS 7 and 8 TeV results extend in (different) new directions. It is interesting to note that “old” CMS 7 TeV results extend further out than the corresponding “new” ones. In fact while the new measurement extended to higher rapidities and lower values in jet \(p_T\), the old measurement contains higher \(p_T\) bins no longer present in the updated result, which turn out to give large values of \(\vec {\delta }\). Finally for \(t\bar{t}\) results we distinguish the observations binned in energy (\(p_T^t\) or \(m_{t\bar{t}}\)) or rapidity (\(y_{\langle t/\bar{t}\rangle }\) or \(y_{t\bar{t}}\)). We can identify differences between the two groups in the visualisation, however as already noted in [1] the data points are not significantly different from the main jet cluster.

It is interesting to study which data points are found to be outlying in the visualisation. These points are highlighted in Fig. 7 and are best distinguished when watching the tour animation:

  • \(|y| > 2.5\) and \(\mu > 950\) GeV – marked with a star symbol: only one such point is found in the 7 TeV data sets. It occurs in ATLAS7new, it is the last rapidity bin and is clearly outlying (large negative values in PCs 1, 2 and 3). However no particular trend is observed when comparing with points in nearby bins. There are two more such data points in the CMS8 data set, but they do not stand out in \(\delta \) space.

  • \(|y| > 2\) and \(\mu > 1000\) GeV – marked with downward pointing triangle. These points are seen to align in a new direction, away from the main cluster highlighting their importance in the fits. They are also useful for comparing the different CMS results: in this case there are common points to both datasets that nevertheless look different, suggesting the need for further study of these points.

  • for CMS8 we also highlight \(|y| < 1\) and \(\mu < 200\) – marked with diamond symbol: they are very different from the main distribution and give large positive values in PC1. It is interesting that we can clearly separate these low \(\mu \) bins in CMS8 set but not in CMS7.

The DIS cluster We next consider subgroups of the DIS cluster for which the TFEP visualisation allowed only limited interpretation. Concretely, while the bulk of the cluster was clearly spanned by the HERA results (ID 160) as expected, other results were found to follow quite different distributions. In particular the Charm SIDIS (ID 147) results are distributed in a different direction, overlapping partly with both the DIS and the jet clusters, while the dimuon SIDIS results (IDs 124–127) were found in the center of the distribution and it was concluded that this cluster extends in an orthogonal direction, although it was not shown explicitly.

We therefore compare in detail these three groups. In this case it is useful to consider both PCA1 and PCA2, the latter more closely related to the TFEP output. First, we observe that the dimuon SIDIS is poorly separated in the PCA2 projection, whereas PCA1 clearly shows how it extends considerably away from the main DIS cluster (ID 160). On the other hand, the charm SIDIS can be separated more easily when studying the directional information in the PCA2 projection because the individual values in the space of deltas are all comparatively small. These results suggest that either predictions for these type of observables are well under control in the existing fits, or that alternatively the experimental errors are too large for them to be constraining. We also observe substructure in the DIS HERA1+2, see Fig. 8 and the corresponding animation, indicating that this group combines a number of qualitatively different types of results.

Comparison with summary statistics We now consider the experimental results with the highest values in reciprocated distances to show they can also be easily distinguished with our visualisation. We highlight three groups in Fig. 9: the HERA dataset (ID 160), the W asymmetry measurements (ID 234, 266 and 281) and the fixed-target Drell-Yan measurements from E605 and E866 (ID 201, 203 and 204).

Fig. 8
figure 8

As Fig. 6, but showing only selected results in the DIS cluster, i.e. DIS HERA1+2 (black), Charm SIDIS (red) and dimuon SIDIS (green). The left view is for PCA1 (see animation here) shows clear separation of dimuon SIDIS results, the right view for PCA2 (see animation here) shows apparent separation of charm SIDIS results obtained by focussing on directional information

Fig. 9
figure 9

Left: Comparison of groups with large reciprocated distance measures, where now the full dataset is shown below in gray. Right: Comparison in subspace found by performing PCA on DY data only, where DY data is shown in red and all other data points are shown below in gray. Again selected views from the grand tour results are shown here. The left view (see animation here) roughly shows how the HERA and WASY data points are far away from the main distribution of data points, while the DY points are found only in the center. The right view (see animation here) illustrates the three different types of distributions found in the DY group

Indeed we find that the W asymmetry measurements (234, 266 and 281) follow a very distinct distribution, as does the HERA DIS dataset (160). On the other hand, the fixed-target Drell-Yan measurements (201, 203 and 204), do not stand out in our visualisation. We find that this is a consequence of the dimension reduction,Footnote 4 and we can easily identify views separating this group from the other data points when considering additional dimensions. Here we show this by looking at projections found by performing PCA on this data subset only and using it to compare it to the other data sets in the subspace of the first 4 PCs thus defined. Note however that the tour allows visualisation of the distributions in the full parameter space which would yield the same information. Our choice of procedure is simply to limit the viewing times required, which grow with the number of dimensions considered.Footnote 5

This type of visualisation, together with inverting the mapping onto principal components, may be used to identify the origin (i.e. underlying physics) of the large differences. For example the first three PCs found for the DY dataset capture three different distributions, and mapping those back to the original \(\delta \) directions together with study of those directions with respect to uncertainty in individual parton pdfs may provide additional insight. Such detailed investigations are however beyond the scope of this study.

5 Summary and conclusions

Starting from the set of 56 dimensional vectors in the space of residual responses calculated in  [1], we have demonstrated how the grand tour may be used for visualizations in particle physics. The 56 dimensions are reduced to 6 dimensions (for illustration) using principal component analysis, and the resulting representation is then passed onto the tour. The findings made about the fits using the tour, even with only 6 dimensions, are more comprehensive and clearer than what TFEP allows.

The tour visualisation verified several results from [1], notably, the separation between DIS, VBP and JET experiments into clusters populating different regions of delta space. It also allowed us to go into further detail by examining certain substructures within these groups. We have moreover demonstrated that the tour can complement and support analyses based on the use of reciprocated distances.

In our examples we have considered performing the PCA either on centered data (PCA1) or on centered and sphered data (PCA2), as they highlight different aspects of the structure, the former retaining length information and the latter emphasizing directionality. In general we find the results from PCA1 more useful, in particular for this application where the length of the individual data point vectors (i.e. for each experiment) carries important information that is lost when sphering the input data.

The sensitivity defined in  [1], or projection of \(\delta \)s onto a direction given by the gradient of a QCD variable (e.g. cross section prediction) can also be inspected visually and the tour permits this visualisation in multiple dimensions.

Table 3 Experimental datasets considered as part of CT14HERA2 and included in the analysis. IDs are following the standard CTEQ labelling system with 1XX/2XX/5XX representing datasets in the DIS/VBP/JET group

We conclude that the above described method is a valuable tool for PDF uncertainty and sensitivity studies. In addition, the visual analysis allows a better understanding of the method itself and can uncover unexpected features, and even possibly errors. It can provide experiments with a guide to the measurements needed to improve PDF fits.