On Distance Mapping from nonEuclidean Spaces to Euclidean Spaces
 1.2k Downloads
Abstract
Most Machine Learning techniques traditionally rely on some forms of Euclidean Distances, computed in a Euclidean space (typically \(\mathbb {R}^{d}\)). In more general cases, data might not live in a classical Euclidean space, and it can be difficult (or impossible) to find a direct representation for it in \(\mathbb {R}^{d}\). Therefore, distance mapping from a nonEuclidean space to a canonical Euclidean space is essentially needed. We present in this paper a possible distancemapping algorithm, such that the behavior of the pairwise distances in the mapped Euclidean space is preserved, compared to those in the original nonEuclidean space. Experimental results of the mapping algorithm are discussed on a specific type of datasets made of timestamped GPS coordinates. The comparison of the original and mapped distances, as well as the standard errors of the mapped distributions, are discussed.
1 Introduction
Traditionally, most data mining and machine learning have relied on the classical (or variations thereof) Euclidean distance (Minkowski distance with the exponent set to 2), over data that lies in \(\mathbb {R}^{d}\) (with \(d\in \mathbb {N}_{+}\)). One problem with this approach is that it forces the data provider to process the original, raw data, in such a way that it is in \(\mathbb {R}^{d}\), while it might not be natural to do so. For example, encoding arbitrary attributes (such as words from a specific, finite set) using integers, creates an underlying order between the elements, which has to be carefully taken care of. It also creates a certain distance between the elements, and the various choices in the conversion process are practically unlimited and difficult to address. We therefore look here into the possibility of not converting the data to a Euclidean space, and retain the data in its original space, provided that we have a distance function between its elements. In effect, we “only” require and concern ourselves, in this paper, with metric spaces, over potentially nonEuclidean spaces. With the fact that most data mining and machine learning methods rely on Euclidean distances and their properties, we want to verify that in such a case, the distances in a nonEuclidean metric space can behave close enough to the Euclidean distances over a Euclidean space. The reasoning behind this is that traditional Machine Learning and Data Mining algorithms might expect the distances between elements to behave in a certain manner, and respect certain properties, which we attempt to mimic with the following distance mapping approach.
A distancemapping algorithm, in the context of this paper, is an approach that takes a set of objects as well as their pairwise distance function in the specific nonEuclidean space (so, a nonEuclidean metric space), and maps those pairwise distances to a canonical Euclidean space, in such a way that the distance distribution among the objects is approximately preserved in the mapped Euclidean space.
Distancemapping algorithms are a useful tool in data applications, for example, data clustering and visualisations. Another good use case for distancemapping, is mutual information [3, 7, 10], which is used to quantitatively measure the mutual dependence between two (or more) sets of random variables in information theory. Mutual information is most often estimated by constructing the knearest neighbors [7, 10] graphs of the underlying data, which thus rely on the Euclidean distances. Hence, there is a strong need to recalculate the distances over a potentially nonEuclidean space.
In the following Sect. 2, we first introduce basic notations to describe the distance mapping approach in Sects. 3, 4 and 5. After a short discussion about important implementation details in Sect. 7, we finally present and discuss results over a synthetic data set in Sect. 8.
2 Notations
As in the data privacy literature, one traditionally defines a dataset of N records by \(\varvec{X} = [\varvec{x_1},...,\varvec{x_N}]^{T}\), the matrix of N samples (records) with d attributes \(\{\varvec{A^{(1)}},...,\varvec{A^{(d)}}\}\). A record \(\varvec{x_l}\) is now defined as \(\varvec{x_l} = [a_l^{(1)},a_l^{(2)},...,a_l^{(d)} ] \), \(a_l^{(j)} \in \mathbb {X}^{(j)}\), where \(\mathbb {X}^{(j)}\) is the set of all the possible values for a certain attribute \(\varvec{A^{(j)}}\). Hence, we can see the vector \([a_1^{(j)}, a_2^{(j)}, ..., a_N^{(j)}]^T \in \mathbb {X}^{(j)} \) as a discrete random variable for a certain attribute over all the N samples.
Let us consider a metric space \(\mathcal {X}^{(j)}=(\mathbb {X}^{(j)}, d^{(j)})\) using the set \(\mathbb {X}^{(j)}\) explained above, endowed with the distance function \(d^{(j)}: \mathbb {X}^{(j)}\times \mathbb {X}^{(j)}\longrightarrow \mathbb {R}_{+}\). Generally, \(\mathcal {X}^{(j)}\) need not be an Euclidean metric space.
3 Distances over nonEuclidean Spaces
Now we consider two metric spaces \(\mathcal {X}^{(i)}=(\mathbb {X}^{(i)}, d^{(i)})\) and \(\mathcal {X}^{(j)}=(\mathbb {X}^{(j)}, d^{(j)})\). Let us assume \(\mathcal {X}^{(i)}\) to be a canonical Euclidean space with the distance function \(d^{(i)}\) the Euclidean norm and \(\mathbb {X}^{(i)}=\mathbb {R}^{d}\), while \(\mathcal {X}^{(j)}\) is a nonEuclidean space endowed with a nonEuclidean distance function \(d^{(j)}\).
We denote by \(f_{\varvec{z^{(j)}}}(d)\) the probability density function (PDF) of \(\varvec{z^{(j)}}\), which describes the pairwise distance distribution over the nonEuclidean metric space \(\mathcal {X}^{(j)}\). In the same way, we define \(f_{\varvec{z^{(i)}}}(d)\) to be the distribution of pairwise distances over the Euclidean metric space \(\mathcal {X}^{(i)}\).
As we are using a limit number of realisations N of the random variables to estimate the distribution \(f_{\varvec{z^{(j)}}}(d)\), the limit over N is based on the assumption that we can “afford” to draw sufficiently large enough number N of the variables to possibly estimate \(f_{\varvec{z^{(j)}}}(d)\) to be close enough to \(f_{\varvec{z^{(i)}}}(d)\). We present the mapping approach used in this paper in the following Sect. 4, by solving an integral equation so as to obtain equal probability masses.
4 Mapping Solution
We propose to use Machine Learning (more specifically, Universal Function Approximators [4]) to map the distribution \(f_{\varvec{z^{(j)}}}\) of the nonEuclidean distance to the distribution \(f_{\varvec{z^{(i)}}}\) of the Euclidean distance, with the fact that most Machine Learning techniques are able to fit a continuous input to another different continuous output.
The following Sect. 5 describes in practice the algorithm to achieve the distance mapping proposed.
5 Algorithm of Distance Mapping
To calculate the integral in the left part of Eq. 3, we first need to construct the distribution function \(f_{\varvec{z^{(j)}}}\) using Machine Learning for functional estimates. The algorithm for distance mapping is explained as follows:
 A.1

Draw as many samples as possible from \(\varvec{x}^{(j)}\) and \(\varvec{y}^{(j)}\) (random variables over \(\mathbb {X}^{(j)}\));
 A.2

Compute \(\varvec{z}^{(j)}=d^{(j)}(\varvec{x}^{(j)},\varvec{y}^{(j)})\);
 A.3

Compute the histogram of \(\varvec{z}^{(j)}\);
 A.4

Use a Machine Learning algorithm to learn this histogram: this creates an unnormalized version of \(f_{\varvec{z}^{(j)}}(t)\);
 A.5

Compute the integral \(f_{\varvec{z}^{(j)}}(t)\) over its domain to obtain the normalizing constant \(\varvec{C}^{(j)}\);
 A.6

Normalize the estimated function from 4. with the constant \(\varvec{C}^{(j)}\);
 A.7

This yields a functional representation \(g^{(j)}(t)\) of \(f_{\varvec{z}^{(j)}}(t)\) that behaves as an estimate of the PDF of \(\varvec{z}^{(j)}\);
 A.8

We can finally integrate \(g^{(j)}(t)\) from 0 to z (which was the given distance value) to obtain a value we denote \(\beta \):
 A.9

We assume the cumulative distribution function (CDF) of the Euclidean distances \(\alpha = d^{(i)}(x,y)\) to be \(F_{\varvec{z}^{(i)}}(\alpha )\). Solving Eq. 3 now becomes:
Note that this algorithm is independent on the nature of \(\mathcal {X}^{(i)}\): at this point, \(\mathcal {X}^{(i)}\) can be any metric space. In the following, we look at the two possibilities of mapping a nonEuclidean space to a Euclidean space, or to another, nonEuclidean space (for completeness sake).

\(\mathcal {X}^{(i)}=(\mathbb {X}^{(i)},d^{(i)})\) is the canonical Euclidean space, i.e. \(\mathbb {X}^{(i)}=\mathbb {R}\) and \(d^{(i)}\) is the Euclidean distance over \(\mathbb {R}\);

\(\mathcal {X}^{(i)}\) is not the canonical Euclidean space, and the set of all necessary values \(\mathbb {X}^{(i)}\) does not have to be \(\mathbb {R}\), while \(d^{(i)}\) is the Euclidean distance.
5.1 First Case
In the case of \(\mathcal {X}^{(i)}\) being the canonical Euclidean space, we can find analytical expressions for \(f_{\varvec{z}^{(i)}}\) in Eq. 3, by making assumptions on how the variables \(\varvec{x}^{(i)}\) and \(\varvec{y}^{(i)}\) are distributed. If such assumptions are not acceptable for some reason, it is always possible to revert to the estimation approach mentioned above, or possibly solve analytically as below for other wellknow distributions.
If \(\varvec{x}^{(i)}\) and \(\varvec{y}^{(i)}\) are normally distributed. We assume that \(\varvec{x}^{(i)}\) and \(\varvec{y}^{(i)}\) follow a normal distribution \(\mathcal {N}(\mu ,\sigma ^2)\) with mean \(\mu \) and variance \(\sigma ^2\). \(\varvec{x}^{(i)}\) and \(\varvec{y}^{(i)}\) are iid.
5.2 Second Case
6 Using ELM to Learn the Functional Distribution
We propose to use Extreme Learning Machines (ELM) [5, 6] as the mapping tool between distance functions. The reason for choosing this specific Machine Learning technique is its excellent performance/computational time ratio among all the techniques. The model is simple and involves a minimal amount of computations. Since we are dealing with the limit problem of the number of records N to estimate the distribution \(f_{\varvec{z}^{(j)}}(d)\) (in Eq. 2), the ELM model is applicable in that it can learn the mapping in reasonable time for large amounts of data, if such a need arises. ELM is a universal function approximator, which can fit any continuous function.
The ELM algorithm was originally proposed by GuangBin Huang et al. in [6], and further developed, e.g. in [8, 9, 12], and analysed in [2]. It uses the structure of a Single Layer Feedforward Neural Network (SLFN) [1]. The main concept behind the ELM approach is its random initialization, instead of a computationally costly procedure of training the hidden layer. The output weights matrix is then to be found between the hidden representation of the inputs and the outputs.
The ELM training does not require iterations, so the most computationally costly part is the calculation of a pseudoinverse of the matrix \(\varvec{H}\). This makes ELM an extremely fast Machine Learning method. Thus, we propose to use ELM to learn the distribution of the pairwise distances over nonEuclidean spaces \(f_{\varvec{z}^{(j)}}(t)\) or \(F_{\varvec{z}^{(j)}}(t)\).
7 Implementation Improvement
When implemented the mapping solution straightforwardly as in Sect. 5, the algorithm spends most of the CPU time on calculating the integral of \(f_{\varvec{z}^{(j)}}(t)\) over the distances \(\varvec{z}^{(j)}\) numerically as in Eq. 4. This consumes lots of computational time. This is because the number of the pairwise distances \(\varvec{z}^{(j)}\) is \(N(N1)/2\), which can obviously grow to a very large value when the data size N increases. Thus, we avoided the integration calculations by using machine leaning to learn the CDF \(F_{\varvec{z}^{(j)}}\), instead of learning the PDF \(f_{\varvec{z}^{(j)}}\) in A.4. This yields a function representation of \(F_{\varvec{z}^{(j)}}(t)\) (with the normalisation constant directly from \(F_{\varvec{z}^{(j)}}\)). \(\beta \) can then be obtained straight from \(F_{\varvec{z}^{(j)}}(z)\).
The second most CPU consuming step in this algorithm is to find the most suitable described distribution of \(f_{\varvec{z}^{(i)}}(t)\) or \(F_{\varvec{z}^{(i)}}(t)\) in the Euclidean space \(\mathcal {X}^{(i)}\). To choose whether the nonEuclidean distribution should best be mapped to a Normal, Uniform, Rayleigh, or other distribution, we have to fit the \(F_{\varvec{z}^{(j)}}(t)\) to those well defined canonical Euclidean distances distributions and find the optimised parameters in the best suitable distribution with the least errors.
Again, if we use the pairwise distances \(\varvec{z}^{(j)}\) in \(F_{\varvec{z}^{(j)}}\) directly, the fitting computation is very heavy as we are trying to fit the data with \(N(N1)/2\) points. To make it easy, we use the functional representation of \(F_{\varvec{z}^{(j)}}(t)\) with the userdefined distances in the predefined domain, with the purpose only to find the best distribution and its parameters (the functional presentation of \(F_{\varvec{z}^{(i)}}(t)\)). Then the mapped distance \(\alpha \) can be obtained from Eq. 6 with the calculated \(\beta \) and the inverse functional representation of \(F_{\varvec{z}^{(i)}}(t)\).
In the following Sect. 8, we present results over the typical data used for this work, GPS traces (latitude and longitude).
8 Experimental Results
Note that the metric space of the GPS coordinates \(\mathcal {X}^{(\varvec{gps})}=(\mathbb {X}^{(\varvec{gps})}, d^{(\varvec{gps})})\) is a nonEuclidean space, because the distance \(d^{(\varvec{gps})}\) of two GPS coordinates \((\varvec{lat},\varvec{lon})\) is the shortest route between the two points on the Earth’s surface, namely, a segment of a great circle.
We first explore the limit condition on the number of records N in Eq. 2, in that N needs to be sufficiently large to possibly estimate \(f_{z^{(j)}}\) (or \(F_{z^{(j)}}\)) to be close enough to \(f_{z^{(i)}}\) (or \(F_{z^{(i)}}\)). We test on experimental datasets with various \(N=10,30,100,1000\), within which each location record is randomly chosen along the introduced trajectory in Fig. 1.
Figure 2 illustrates the comparisons of the CDF \(F_{\varvec{z}^{(j)}}(d)\) of the pairwise distances obtained from \(\mathcal {X}^{(\varvec{gps})}=(\mathbb {X}^{(\varvec{gps})}, d^{(\varvec{gps})})\), and the CDF \(F_{\varvec{z}^{(i)}}(d)\) of the mapped distances in the Euclidean space, with \(N=10,30,100,1000\) for the four subplots respectively.
It is clear to see that, in this specific simple case, with small N values of 10 and 30, there exists comparable disagreements of the CDF distributions between \(\mathcal {X}^{(\varvec{gps})}\) and the mapped Euclidean space. Meanwhile, with the larger N values of 100 and 1000, the limit condition on N is well satisfied, as it is plain to see that the nonEuclidean GPS metric \(\mathcal {X}^{(\varvec{gps})}\) behaves over its nonEuclidean space, accurately close to the mapped Euclidean metric over the mapped Euclidean space.
Thus, we can see that the number of record \(N=100\) is sufficient to closely estimate the distribution \(f_{\varvec{z^{(j)}}}(d)\) to \(f_{\varvec{z^{(i)}}}(d)\) in this very simple case. The Standard Errors (SE) of the mapped distribution \(f_{\varvec{z^{(i)}}}\) are calculated, meanwhile selecting meticulously denser and broader N values from 5 to 5000, along the specific route. Figure 3 shows the SEs of the mapped distribution with the dependence on N. As the latitude and longitude coordinates are linearly altering in this simple case, the distances are mapped to a uniform distribution straightforwardly. The SE of the mapped \(f_{\varvec{z^{(i)}}}(d)\) converges closely to 0 at very small \(N \simeq 50\).
9 Conclusion
We have developed and implemented a distancemapping algorithm, which projects a nonEuclidean space to a canonical Euclidean space, in a way that the pairwise distances distributions are approximately preserved. The mapping algorithm is based on the assumptions that both spaces are actual metric spaces, and that the number of records N is large enough to estimate the pairwise distances \(f_{z^{(j)}}\) (or \(F_{z^{(j)}}\)) to be close enough to \(f_{z^{(i)}}\) (or \(F_{z^{(i)}}\)). We have tested our algorithm by illustrating the distance mapping of an experimental dataset of GPS coordinates. The limitation condition of N is discussed by the comparison of \(F_{z^{(j)}}\) and its mapped \(F_{z^{(i)}}\), using various N values. The standard errors of the mapped distance distribution \(F_{z^{(i)}}\) is also analyzed with various N.
Our distance mapping algorithm is performed with the most common canonical Euclidean distance distributions. Certainly, less common distributions are needed to be implemented as well, and might require specific adjustments. More diversified experimental examples are needed for completeness.
References
 1.Auer, P., Burgsteiner, H., Maass, W.: A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Netw. 21(5), 786–795 (2008)CrossRefzbMATHGoogle Scholar
 2.Cambria, E., Huang, G.B., Kasun, L.L.C., Zhou, H., Vong, C.M., Lin, J., Yin, J., Cai, Z., Liu, Q., Li, K., et al.: Extreme learning machines (trends & controversies). IEEE Intell. Syst. 28(6), 30–59 (2013)CrossRefGoogle Scholar
 3.Cover, T.M.: Elements of Information Theory. Wiley, New York (1991)CrossRefzbMATHGoogle Scholar
 4.Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. (MCSS) 2(4), 303–314 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
 5.Huang, G.B., Chen, L., Siew, C.K., et al.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006)CrossRefGoogle Scholar
 6.Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1), 489–501 (2006)CrossRefGoogle Scholar
 7.Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004)MathSciNetCrossRefGoogle Scholar
 8.Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., Lendasse, A.: OPELM: optimally pruned extreme learning machine. IEEE Trans. Neural Netw. 21(1), 158–162 (2010)CrossRefGoogle Scholar
 9.Miche, Y., Van Heeswijk, M., Bas, P., Simula, O., Lendasse, A.: TROPELM: a doubleregularized ELM using LARS and tikhonov regularization. Neurocomputing 74(16), 2413–2421 (2011)CrossRefGoogle Scholar
 10.Pál, D., Póczos, B., Szepesvári, C.: Estimation of rényi entropy and mutual information based on generalized nearestneighbor graphs. In: Advances in Neural Information Processing Systems, pp. 1849–1857 (2010)Google Scholar
 11.Rao, C.R., Mitra, S.K.: Generalized Inverse of Matrices and Its Applications. Wiley, New York (1971)zbMATHGoogle Scholar
 12.Van Heeswijk, M., Miche, Y., Oja, E., Lendasse, A.: Gpuaccelerated and parallelized ELM ensembles for largescale regression. Neurocomputing 74(16), 2430–2437 (2011)CrossRefGoogle Scholar