Keywords

1 Introduction

Clear and informative visualization of similarities between populations is a key component both in the development of methodology and in scientific publications. Depending on the particular use case, a wide range of techniques are available. One such visualization technique is the Taylor diagram (TD) [10], which was devised to relate several statistical quantities and allow for comparison of numerous data points in a single diagram. The TD has been frequently used in numerous application, and particularly in climate sciences [6, 8]. However, the statistical quantities displayed in the TD does have some weaknesses that limit the usability of the diagram. For instance, one quantity in the diagram is the Pearson correlation coefficient, which only models linear relationship and can be sensitive to outliers. This curtails the TD, as many real-world applications use data with outliers and that are connected through non-linear relationships.

Fig. 1.
figure 1

KTD: The radial distance from the origin to each point is proportional to the length of kernel mean embedding. The distance between the points is the maximum mean discrepancy.

One of the most well-known and widely used approaches for measuring similarity in machine learning is through kernel methods [3, 4]. At its core, a kernel function corresponds to a dot product in a high-dimensional feature space, where non-linear relationship between data in the input space can be linearly related in the new feature space. As long as the kernel is positive definite, the mapping to the feature space does not have to be computed explicitly.

In this paper we propose the kernelized Taylor diagram (KTD), which is illustrated in Fig. 1. This diagram relates well-known quantities from the kernel literature [9], namely the maximum mean discrepancy (MMD) and the kernel mean embedding in a single figure. To the best of our knowledge, such a diagram has never been devised prior to this work. The KTD makes no assumptions on the distributions of the populations and can model a rich family of relationships between populations. The functionality of the proposed diagram is demonstrated on synthetic data. Code: https://github.com/Wickstrom/KernelizedTaylorDiagram.

2 The Kernelized Taylor Diagram

Taylor Diagram. The TD was introduced as a tool that could relate several statistical quantities in a single figure [10]. It strength lies in the ability to compare numerous data points where it would otherwise be necessary to utilized several figures and/or tables. The theoretical starting point of the TD is the Pearson correlation coefficient \(\rho \) and the root-mean-squared-error E between two data points. [10] argued that neither are sufficient to capture potential similarities on their own, but in the aggregate the they are capable of detecting a wide range of differences between data points. Let \(\textbf{x}\) and \(\textbf{z}\) represent two D-dimensional vectors representing two data points. The correlation coefficient between \(\textbf{x}\) and \(\textbf{z}\) is defined as:

$$\begin{aligned} \rho = \frac{1}{D}\sum \limits _{d=1}^D\frac{(x_d-\bar{x})(y_d-\bar{z})}{\sigma _x\sigma _y}, \end{aligned}$$
(1)

where \(\bar{x}\) and \(\bar{y}\) are the mean values and \(\sigma _x\) and \(\sigma _y\) are the standard deviations. The root-mean-squared-error for mean centered data points is defined as:

$$\begin{aligned} E^2&= \mathbb {E}\Bigg [\frac{1}{D}\sum \limits _{d=1}^D\Big ((x_d-\bar{x}) - (z_d-\bar{z})\Big )^2\Bigg ] \nonumber \\&=\underbrace{\frac{1}{D^2}\mathbb {E}\Big [\sum \limits _{d=1}^D(x_d-\bar{x})^2\Big ]}_{\sigma _x^2}+\underbrace{\frac{1}{D^2}\mathbb {E}\Big [\sum \limits _{d=1}^D(y_d-\bar{y})^2\Big ]}_{\sigma _y^2}-\underbrace{\frac{1}{D^2}\mathbb {E}\Big [\sum \limits _{d=1}^D(x_d-\bar{x})(y_d-\bar{y})\Big ]}_{\sigma _{xy}}\nonumber \\&= \sigma _x^2+\sigma _y^2-2\sigma _x\sigma _y\rho . \end{aligned}$$
(2)

The key point of the TD is recognize the relationship between the statistical quantities in Eq. 2 and the law of cosines:

$$\begin{aligned} c^2 = a^2+b^2-2ab\cos (\theta ). \end{aligned}$$
(3)

Here, a and b are the lengths of two sides of a triangle with angle \(\theta \) between each other and an opposite side of length c. The TD has seen widespread use in several domains such as in geophysical sciences [6, 8]. Nevertheless, the TD has some key weaknesses that limits it functionality in many practical applications. The Pearson correlation coefficient has a number of limitations [1]. It can only model linear relationships [2], which can be restricting in many practical application. Also, the Pearson correlation coefficient is known be sensitive to outliers [1].

The Kernelized Taylor Diagram. To address such limitations, we propose the KTD, which uses well-know measures from the kernel literature to model similarities between populations. The starting point of the KTD is one of the most widely used distance measures in the kernel literature, namely the maximum mean discrepancy (MMD) [7], which measures the distance between two distributions where each distributions is represented by a mean embedding of the data. Let \(X\sim P\) and \(Y\sim Q\), and \(\boldsymbol{\mu }_x\) and \(\boldsymbol{\mu }_y\) denoted the mean embedding vectors representing two distributions P and Q. Then, the MMD is defined as the norm between the two embeddings in a reproducing kernel Hilbert space \(\mathcal {H}\):

$$\begin{aligned} \begin{aligned} MMD^2&= \Vert \boldsymbol{\mu }_x - \boldsymbol{\mu }_y\Vert _\mathcal {H}^2 \\&= \Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}^2 + \Vert \boldsymbol{\mu }_y\Vert _\mathcal {H}^2 - 2 \langle \boldsymbol{\mu }_x, \boldsymbol{\mu }_y \rangle _\mathcal {H} \\&= \Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}^2 + \Vert \boldsymbol{\mu }_y\Vert _\mathcal {H}^2 - 2 \Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}\Vert \boldsymbol{\mu }_y\Vert _\mathcal {H} \frac{\langle \boldsymbol{\mu }_x, \boldsymbol{\mu }_y \rangle _\mathcal {H}}{\Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}\Vert \boldsymbol{\mu }_y\Vert _\mathcal {H}} \\&= \Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}^2 + \Vert \boldsymbol{\mu }_y\Vert _\mathcal {H}^2 - 2 \Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}\Vert \boldsymbol{\mu }_y\Vert _\mathcal {H} \cos \angle (\boldsymbol{\mu }_x, \boldsymbol{\mu }_y). \end{aligned} \end{aligned}$$
(4)

In general, the true data distributions are not known, so the mean embeddings are replaced by empirical mean embeddings that are estimated based on samples from each distribution:

$$\begin{aligned} \hat{\boldsymbol{\mu }}_x = \frac{1}{N}\sum \limits _{n=1}^N \kappa (\textbf{x}_n, \cdot ), \end{aligned}$$
(5)

where \(\kappa (\cdot , \cdot )\) is a positive definite kernel that measures similarity between data points. If the kernel is characteristic [7], MMD is a metric and is zero only if the two distributions are equal. [5] showed that the well-known Gaussian kernel with kernel width \(\sigma \), \(G_\sigma (\textbf{x}_i, \textbf{x}_j)=\exp (||\textbf{x}_i-\textbf{x}_j||^2/2\sigma )\), is a characteristic kernel. Furthermore, MMD does not assume a particular distribution of the data, and can capture both non-linear and linear relationships between distributions.

Similarly as with the TD, we recognize the law of cosines in Eq. 4. The mean embeddings of the two distributions are the side lengths of a triangle with angle \(\cos \angle (\boldsymbol{\mu }_x, \boldsymbol{\mu }_y)\) between each other and an opposite side with length equal to the MMD between the distributions. The KTD is shown in Fig. 1.

The length of the mean embeddings indicate the distance from the origin to each point in the KTD. For the Gaussian kernel, the kernel mean embedding captures all moments of the data population [9]. But it is not obvious how to interpret what information the kernel mean embeddings are illustrating in the diagram. However, the kernel mean embeddings can be related to uncertainty through the information potential (IP) from information theoretic learning [11], which allows for a similar interpretation of the KTD as the TD. That is, the kernel mean embeddings correspond to the \(\sigma \) in Eq. 2. In most applications, the IP must be estimated from data. In information theoretic learning, the IP is often estimated through the quadratic IP estimator using a Gaussian kernel [11]:

$$\begin{aligned} \hat{V}_{2, \sigma }(X) = \frac{1}{N^2}\sum \limits _{i, j}^N G_\sigma (\textbf{x}_i, \textbf{x}_j). \end{aligned}$$
(6)

Next, the squared norm terms in Eq. 4 can be expressed as:

$$\begin{aligned} \Vert \boldsymbol{\mu }_x\Vert _\mathcal {H}^2 = \frac{1}{N^2}\sum \limits _{i, j}^N \kappa (\textbf{x}_i, \textbf{x}_j). \end{aligned}$$
(7)

If the mean embeddings are calculated using a Gaussian kernel, Eq. 6 and Eq. 7 are equivalent. Furthermore, the IP is related to entropy as follows:

$$\begin{aligned} \hat{H}_2(X) = -\log (\hat{V}_{2, \sigma }(X)). \end{aligned}$$
(8)

Entropy measures the amount of information in a random variable, but can also be interpreted as a measure of uncertainty. High entropy indicates more variation in the data, while low entropy means that the data is clustered together. From Eq. 8 it is evident that when the information potential of X is high and the entropy will be low, and the opposite when the information potential of X is low. For the KTD, this means that random variables with a high value for the kernel mean embedding, and thus far from the origin, is associated with low uncertainty, and oppositely for a low value of the kernel mean embedding. This insight is important, as it allows us to relate concepts from the TD to the KTD.

Fig. 2.
figure 2

Comparison of TD with the KTD on the data described in Sect. 3. The experiment illustrates how the TD is not able to capture non-linear dependencies and is sensitive to outliers, when compared with the proposed KTD.

3 Experiments

To illustrate the functionality of the KTD we consider the case were the true distribution of the data is known and generate 1000 samples from 5 different populations. The reference distribution \(X_{\text {ref}}\) is sampled from a standard normal distribution. The remaining populations are constructed as follows:

$$\begin{aligned} X_1&\sim 2X_{\text {ref}}+\epsilon , \quad X_2 \sim \frac{X_{\text {ref}}}{2}+\epsilon , X_3 \sim X_{\text {ref}}^2+\epsilon , \\ X_4&\sim X_{\text {ref}}\sin (X_{\text {ref}})+\epsilon , X_O \sim \frac{X_{\text {ref}}}{2}+\epsilon \text { (with outliers),} \end{aligned}$$

where \(\epsilon \sim \mathcal {N}(0, 0.01)\). Population \(X_1\) and \(X_2\) are chosen to represent a linear relationship to the reference distribution, but with different scaling such that the standard deviation is different compared to the reference. Population \(X_3\) and \(X_4\) are chosen to represent a non-linear relationship with the reference. Lastly, \(X_0\) is chosen to also have a linea relationship with the reference, but with two outliers added to the population. These two outliers are samples from \(\mathcal {N}(10, 1)\).

Figure 2a displays the TD for these populations in relation to the reference distribution, while Fig. 2b shows the KTD. First, we consider Fig. 2a. Note that \(X_1\) and \(X_2\) both have a high similarity with the reference but with different length from the origin as a result of the difference in standard deviation. Next, both \(X_3\) and \(X_4\) are indicated as having low similarity with the reference, which is expected since the relationship is non-linear. Lastly, \(X_O\), which is almost identical to \(X_2\) except for two outliers, shows a much lower similarity score. This illustrates how sensitive the TD can be to outliers.

In Fig. 2b, \(X_1\) and \(X_2\) also shows a related and high similarity score. However, note that compared to Fig. 2a, the distance to the origin have been changed, which is explained through the connection to the information potential described in Sect. 2. Next, both \(X_3\) and \(X_4\) are now indicated to have a high similarity with the reference, which illustrates that the KTD is capable of capturing non-linear similarities. Lastly, \(X_2\) and \(X_O\) are located at almost the same point in the diagram, which shows that the KTD is robust against outliers in the data.

4 Conclusion

In this article we proposed the KTD, which relates well-known quantities from the kernel literature in a single diagram. To the best of our knowledge, such a diagram has not been devised previously. Our proposed diagram addresses some key limitation in the widely used TD, such as modeling non-linear relationships and outliers in the data. In future works, we intend to examine the usability of the diagram on real-world data such as in climate applications. We believe that the KTD can be a useful tool in many machine learning applications.