This section reviews all unsupervised hubness reduction methods used in this paper. Some methods operate on vector data directly, others on distances between pairs of objects. Usually, Euclidean or cosine distances are used as input for the latter methods. Some methods also operate on non-metric dissimilarities or similarities. We use Euclidean distances as primary distances, unless \({{\mathrm{kNN}}}\)-classification with cosine distances yields significantly better results (McNemar test, not shown for brevity). Let \(B \subseteq \mathbb {R}^m\) be a non-empty data set with n data objects in m-dimensional space, that is, \(b_i = ( b_{i,1}, \dots , b_{i,m} ) \in B\) for \(i \in \{ 1,\dots ,n \}\). Let \(\varvec{x}\), \(\varvec{y}\), and \(\varvec{z}\) be short-hands for three m-dimensional numeric vectors \(b_x, b_y\), and \(b_z\), respectively. Let \(d : B \times B \rightarrow \mathbb {R}\) be a measure of dissimilarity. The dissimilarity between two objects \(\varvec{x}\) and \(\varvec{y}\) is then denoted as \(d_{x,y}\). Most hubness reduction methods have tunable hyperparameters. We try to follow the notation of the original publications, and thus reuse some symbols in multiple methods. We do so only, if their meaning is closely related. For example, k always refers to neighborhood size, though individual methods may use nearest neighbor information differently. Descriptions of all parameters follow in the next sections.
Measuring hubness
Before we introduce hubness reduction methods, we briefly introduce measures commonly used for describing the degree of hubness in a data set.
k-occurrence
The k-occurrence \(O^k(x)\) of an object \(\varvec{x}\) is defined as the number of times \(\varvec{x}\) resides among the k nearest neighbors of all other objects in the data set. In the notion of network analysis, \(O^k(x)\) is the indegree of \(\varvec{x}\) in a directed \({{\mathrm{kNN}}}\) graph. It is also known as reverse neighbor count.
Hubness
Hubness is typically measured as the skewness of the k-occurrence distribution [44]:
$$\begin{aligned} S^k = \frac{\mathbb {E}[(O^k - \mu _{O^k})^3]}{\sigma _{O^k}^3}, \end{aligned}$$
(1)
where \(\mu _{O^k}\) and \(\sigma _{O^k}\) denote the mean and standard deviation of the k-occurrence distribution, respectively. Typical values of k used in the literature include 1, 5, 10, and 20. Previous research indicates the choice of k to be non-critical. For the real-world data sets used in this paper, we observe very high correlation of k-occurrence among various k values (Fig. 1), except for \(k=1\), which is less correlated. We therefore deem any values of \(5 \le k \ll n\) suitable for analysis of hubness reduction and use \(k=10\) for all hubness measurements in this paper.
Methods based on repairing asymmetric relations
The following methods aim at repairing asymmetric neighbor relations. All these methods compute secondary distances by transforming the original primary distance (for example, Euclidean or cosine) in a data set.
Local scaling and the non-iterative contextual dissimilarity measure
Local scaling (LS) was proposed to improve spectral clustering performance on data of multiple scales [64]. Pairwise secondary distances are calculated as:
$$\begin{aligned} {{\mathrm{LS}}}(d_{x,y}) = 1 - \exp \left( -\frac{d_{x,y}^2}{\sigma _x \sigma _y} \right) . \end{aligned}$$
(2)
The scaling parameter \(\sigma _x (\sigma _y)\) is set to the distance between object \(\varvec{x} (\varvec{y})\) to its k-th nearest neighbor. LS induces increased symmetry in nearest neighbor relations by incorporating local distance information and was proposed for hubness reduction for that reason [49].
The non-iterative contextual dissimilarity measure (NICDM, [33]) is closely related to local scaling: The scaling factor of an object \(\varvec{x}\) is set to the mean distance to its k nearest neighbors (compared to using only the k-th neighbor in LS). We use NICDM transformations adapted for hubness reduction [49]:
$$\begin{aligned} {{\mathrm{NICDM}}}(d_{x,y}) = \frac{d_{x,y}}{\sqrt{\mu _x\,\mu _y}}, \end{aligned}$$
(3)
where \(\mu _x\) denotes the mean distance from object \(\varvec{x}\) to its k-nearest neighbors (analogous for \(\mu _y\) and object \(\varvec{y}\)). Parameter k in both LS and NICDM should reflect the embedding space around each object and can be tuned in order to minimize hubness.
Global scaling: mutual proximity
While LS and NICDM use local distance statistics to enforce symmetric neighborhoods, mutual proximity (MP, [49]) incorporates information of all pairwise distances in the data set to achieve the same. Let X be a random variable of distances between \(\varvec{x}\) and all other objects in the data set (analogously for Y and \(\varvec{y}\)), and P the joint probability density function, then
$$\begin{aligned} {{\mathrm{MP}}}(d_{x,y}) = P ( X> d_{x,y} \cap Y > d_{y,x} ). \end{aligned}$$
(4)
Secondary distances are calculated as the complement of the joint probability of two objects being nearest neighbors to each other (i.e., \(1 - {{\mathrm{MP}}}\)). To allow for this probabilistic view, MP models the distances \(d_{x,i \in \{1,\dots ,n\} \setminus x}\) between an object \(\varvec{x}\) and all other objects with some distribution. When using the empirical distance distribution, mutual proximity between two objects \(\varvec{x}\) and \(\varvec{y}\) is calculated by counting objects whose distances to both \(\varvec{x}\) and \(\varvec{y}\) are greater than d(x, y):
$$\begin{aligned} {{\mathrm{MP}}}(d_{x,y}) = \frac{|\{ j : d_{x,j}> d_{x,y}\} \cap \{ j : d_{y,j} > d_{y,x}\}|}{n - 2}. \end{aligned}$$
(5)
Compared to the formula in the Ref. [49], we added a subtrahend to the denominator to account for identity distances. This influences the normalization to the [0, 1] range but does not change neighborhood order, hubness, or nearest neighbor classification.
In the framework of MP, distances can also be modeled with any (continuous) distribution. This is especially useful, when the user has prior knowledge of the given data domain. Additionally, if X and Y are assumed to be independent, Formula 4 simplifies to
$$\begin{aligned} {{\mathrm{MP}}}^\text {I}(d_{x,y}) = P(X> d_{x,y}) \cdot P(Y > d_{y,x}). \end{aligned}$$
(6)
These approximations simplify calculations and decrease the computational complexity of MP. The Gaussian-based mutual proximity variant (MP\(^\mathrm{GaussI}\)) models the distances of each object \(\varvec{x}\) to all other objects with a normal distribution (\(X \sim \mathcal {N}(\mu , \sigma ^2)\)). Parameters \(\mu _x\) and \(\sigma ^2_x\) can be estimated with the sample mean \(\hat{\mu }_x\) and variance \(\hat{\sigma }^2_x\):
$$\begin{aligned} \mathcal {N}_{x} \backsim \qquad \hat{\mu }_{x} = \frac{1}{n-1} \sum _{i=1, i \ne x}^{n} d_{x, i}, \qquad \hat{\sigma }_{x}^2 = \frac{1}{n-1} \sum _{i=1, i \ne x}^{n} (d_{x,i} - \hat{\mu }_{x})^2 \end{aligned}$$
(7)
Compared to Ref. [49], we exclude self-distances \(d_{x,x}\) from parameter estimation. This should presumably improve the approximation, since self-distances are not informative. Secondary distances based on MP\(^\mathrm{GaussI}\) are calculated as
$$\begin{aligned} {{\mathrm{MP}}}^\text {GaussI}(d_{x,y}) = \hbox {SF}(d_{x,y}, \hat{\mu }_x, \hat{\sigma }^2_x) \cdot \hbox {SF}(d_{y,x}, \hat{\mu }_y, \hat{\sigma }^2_y), \end{aligned}$$
(8)
where \(\hbox {SF}(d, \mu , \sigma ^2) = 1 - \hbox {CDF}(d, \mu , \sigma ^2)\), that is, the survival function (complement to the cumulative density function) at value d given the indicated distribution.
Shared nearest neighbors and simhub
A shared neighborhood is the intersection of the nearest neighbor sets of two objects [32]. Secondary distances based on shared nearest neighbors (SNN) increase pairwise stability and relation symmetry, which is considered beneficial for hubness reduction [20]. SNN similarities are calculated as:
$$\begin{aligned} {{\mathrm{SNN}}}(x,y) = \frac{| {{\mathrm{kNN}}}(x) \cap {{\mathrm{kNN}}}(y) |}{k}, \end{aligned}$$
(9)
where \({{\mathrm{kNN}}}(\cdot )\) is the set of the k-nearest neighbors of some object.
Simhub [59] is a shared neighbors approach that weights shared neighbors \(\varvec{z}\) by informativeness (increasing weights of rare neighbors) and purity (penalizes neighborhoods with inconsistent class labels). Both weights may be used simultaneously (simhub) or separately (simhub\(^\mathrm{IN}\) and simhub\(^\mathrm{PUR}\) for informativeness and purity, respectively). Simhub is a supervised method when using purity weights. We thus restrict our evaluation to the unsupervised simhub\(^\mathrm{IN}\):
$$\begin{aligned} \text {simhub}^\text {IN}(x,y)= & {} \frac{\sum _{z \in ({{\mathrm{kNN}}}(x) \cap {{\mathrm{kNN}}}(y))}I_n(z)}{k \cdot \max I_n}, \nonumber \\ I_n(z)= & {} \log \frac{n}{O^k(z)+1}, \quad \max I_n = \log n \end{aligned}$$
(10)
where \(I_n(z)\) is the occurrence informativeness of a shared neighbor \(\varvec{z}\) in a data set of size n. The neighborhood radius k can be tuned in both SNN-based methods to minimize hubness. Computing \(1 - {{\mathrm{SNN}}}\), or \(1 - \text {simhub}\) turns the similarities into distances.
Methods based on spatial centrality reduction and density gradient flattening
Centering approaches aim at reducing spatial centrality, and use modified inner product similarities to span distance spaces. Global and local DisSim try to flatten density gradients, and construct dissimilarities from squared Euclidean distances.
Centering and localized centering
Centering is a widely used preprocessing step that shifts vectors (\(\varvec{x}\)) so that the space origin coincides with the global centroid (\(\bar{c}\)). Centering dissimilarities can be calculated as
$$\begin{aligned} \text {CENT}(x,y) = - \langle x-\bar{c}, y-\bar{c} \rangle , \end{aligned}$$
(11)
where \(\langle \cdot , \cdot \rangle \) is the inner product of two vectors. The method was proposed for hubness reduction in the context of natural language processing [56]. Centering moves the centroid to the origin. Inner product dissimilarities between any object and the origin (zero vector) are uniformly zero. Centering effectively eliminates spatial centrality in inner product spaces, which should reduce hub emergence. Following this idea, localized centering was developed [27]: Instead of shifting the whole vector space, LCENT is a dissimilarity measure based on global affinity (mean similarity between an object \(\varvec{x}\) and all other objects) and local affinity (mean similarity between \(\varvec{x}\) and its k nearest neighbors):
$$\begin{aligned} {{\mathrm{LCENT}}}(x,y) = - \langle x, y \rangle + \langle x, c_k(x) \rangle ^{\gamma }, \end{aligned}$$
(12)
where \(c_k(x)\) denotes the local centroid among the k nearest neighbors of \(\varvec{x}, \gamma \) is a parameter controlling the penalty introduced by the second term, and the leading negative sign indicates dissimilarities. LCENT dissimilarities are not guaranteed to be positive. Parameters \(\gamma \) and k can be tuned to minimize hubness.
Global and local dissimilarity measures
The above-described centering approaches have no effect on Euclidean distances. As an alternative, two dissimilarity measures were introduced [26]: They reduce hubness by flattening the density gradient and thus eliminate spatial centrality in commonly used Euclidean spaces. The global variant DisSim\(^\mathrm{Global}\) (DSG) removes sample-wise centrality of two objects \(\varvec{x}\) and \(\varvec{y}\):
$$\begin{aligned} {{\mathrm{DSG}}}(x,y) = \Vert x - y \Vert ^2_2 - \Vert x - c\Vert ^2_2 - \Vert y - c\Vert ^2_2, \end{aligned}$$
(13)
where c is the global centroid and \(\Vert \cdot \Vert ^2_2\) indicates the squared Euclidean norm.
The local variant DisSim\(^\mathrm{Local}\) (DSL) is free from the assumption that all instances in the data set come from the same distribution: Instead of subtracting the global centroid, local centroids are estimated as \(c_k(x) = \frac{1}{k}\sum _{x' \in {{\mathrm{kNN}}}(x)} x'\), where \({{\mathrm{kNN}}}(x)\) is the set of k-nearest neighbors of \(\varvec{x}\), and substitution in Formula 13 yields:
$$\begin{aligned} {{\mathrm{DSL}}}(x,y) = \Vert x - y \Vert ^2_2 - \Vert x - c_k(x)\Vert ^2_2 - \Vert y - c_k(y)\Vert ^2_2. \end{aligned}$$
(14)
Parameter k can be tuned to minimize hubness.
Hubness-resistant dissimilarity measures
The methods described in this section try to avoid hubness by using alternative distance measures between data objects.
Choosing \(\ell ^{p}\) norms and the \(m_p\)-dissimilarity measure
Euclidean distances correspond to a special case of the family of \(\ell ^p\) norms (also known as Minkowski norms) with \(p=2\). The effect of using norms with \(p\ne 2\) in the context of hubness has been investigated previously [21]. An \(\ell ^p\) norm of a vector \((\varvec{x} - \varvec{y})\) can be interpreted as a dissimilarity between \(\varvec{x}\) and \(\varvec{y}\) and is calculated as follows:
$$\begin{aligned} d^{p}\left( x,y\right) = \left( \sum _{i=1}^m | x_i - y_i |^p\right) ^{1/p} \end{aligned}$$
(15)
For \(0<p<1\) the resulting Minkowski norms (also called fractional norms) do not guarantee the triangle inequality. Consequently, they do not constitute full distance metrics. The parameter p can be tuned to minimize hubness. In this work, we evaluate \(\ell ^p\) norms with \(p = 0.25, 0.5, ..., 5\) (as in Ref. [21]) and ten values randomly selected from ]0, 5[.
A data-dependent dissimilarity measure was recently derived from \(\ell ^{p}\) norms [2]. The \(m_p\)-dissimilarity takes into account data distributions by estimating the probability mass \(|R_i(x, y)|\) in a region R around \(\varvec{x}\) and \(\varvec{y}\) in each dimension i:
$$\begin{aligned} m_p(\varvec{x}, \varvec{y}) = \left( \frac{1}{m}\sum _{i=1}^m \left( \frac{|R_i(x, y)|}{n} \right) ^p\right) ^{1/p}, \quad |R_i\left( x, y\right) | = \sum _{q=l}^u |h_{i q}| \end{aligned}$$
(16)
That is, all objects are binned in each dimension. Let \(h_{il}\) and \(h_{iu}\) be the bins that contain \(\hbox {min}(x_i, y_i)\) and \(\hbox {max}(x_i, y_i)\), respectively. The probability data mass \(|R_i|\) is then estimated by counting the objects in all bins from \(h_{il}\) to \(h_{iu}\). \(R_i\) replaces the geometric distance used in \(\ell ^p\) norms. Dissimilarities are thus increased in dense regions and decreased in sparse regions.
Time and space complexity
Hubness reduction is expensive due to calculation of distances between all pairs of objects. Table 1 lists time complexity of all methods. All methods applied on data vectors require \(\mathcal {O}(n^2 m)\) timeFootnote 1. Since their prefactors differ considerably, timings for two synthetic data sets of increasing size and dimensionality are also provided. Methods applied on data distances require \(\mathcal {O}(n^2)\) or \(\mathcal {O}(n^3)\) time in addition to \(\mathcal {O}(n^2 m)\) time for preprocessing primary distances. All methods require \(\mathcal {O}(n^2)\) space for returning the distance matrix. For primary distances in data sets with \(m > n\) this is dominated by the memory requirement of the input vectors \(\mathcal {O}(n m)\). Intermediate steps typically require \(\mathcal {O}(n)\) space, except for \(m_p\)-dissimilarity, which requires \(\mathcal {O}(b^2 m)\) for distances between all pairs of bins in each dimension, where b is the number of bins, and \(b \ll n\).
Table 1 Time complexity for computing distances between all pairs of objects and timings for two synthetic data sets (\(B_{1 000}\) with \(n=m=1 000\) and \(B_{10 000}\) with \(n=m=10 000\)). Timings were performed with the Hub-Toolbox for Python (see Sect. 3.6) on a single core of an Intel Core i5-6500 CPU 3.20GHz with 15.6 GiB main memory OFAI Hub-Toolbox
The availability of machine learning algorithms not only as formulas, but also as working code in reference implementations allows easy reproducibility and applicability of methods. Consequently, all methods described in this publication are available as part of a free open source software package for the Python programming environment. The Hub-Toolbox is easily installable from the PyPI package repositoryFootnote 2 and licensed under GNU GPLv3. Please visit the GitHub pageFootnote 3 for source code, development versions, issue tracking, and contribution possibilities. A MATLAB version of the Hub-Toolbox providing core functionality is also available on GitHub.Footnote 4