Active pairwise distance learning for efficient labeling of large datasets by human experts

In many machine learning applications, the labeling of datasets is done by human experts, which is usually time-consuming in cases of large data sets. This raises the need for methods to make optimal use of the human expert by selecting model instances for which the expert opinion is of most added value. This paper introduces the problem of active pairwise distance learning (APDL), where the goal is to actively learn the pairwise distances between all instances. Any distance function can be used, which means that APDL techniques can e.g., be used to determine likeness between faces or similarities between users for recommender systems. Starting with an unlabeled dataset, each round an expert determines the distance between one pair of instances. Thus, there is an important choice to make each round: ‘Which combination of instances is presented to the expert?’ The objective is to accurately predict all pairwise distances, while minimizing the usage of the expert. In this research, we establish upper and lower bound approximations (including an update rule) for the pairwise distances and evaluate many domain-independent query strategies. The observations from the experiments are therefore general, and the selection strategies are ideal candidates to function as baseline in future research. We show that using the criterion max degree consistently ranks amongst the best strategies. By using this criterion, the pairwise distances of a new dataset can be labeled much more efficiently.


Introduction
A dataset plays a critical part when solving a practical problem using machine learning (ML).Often, the goal is to predict some target variable using measured features of other variables.When gathering the data, it would be ideal if the target variable could be measured.For example, consider the task of forecasting the outside temperature using multiple other measurements, such as atmospheric pressure, wind speed and humidity.In this case, the label (temperature) can be determined efficiently.In other cases, the labels are not as easily acquired.For example, to predict if a face is visible in a photograph requires human expertise at some point to label a dataset.In such cases, human involvement is sometimes necessary, especially when a model is trained to replicate human knowledge or skills.
Labeling using a human expert is a time-consuming and costly undertaking.Therefore, efforts should be focused on maximizing the usefulness of the expert when it is too expensive to label everything.Typical questions are: 'How should the expert be deployed?' and 'Which samples should be labeled?'These questions are all part of the research field called active learning (AL) [1].It is a subfield of ML dedicated to achieving the best prediction performance with as few labels as possible.To this end, a human expert can be queried about an instance each round.The expert then determines a label for this instance, which in turn can be used to update a prediction model and determine the next query.This cycle continues for a fixed number of rounds or until some other stopping criterion is met [2][3][4].
AL is useful in situations where simply labeling all data instances is too expensive.For example, suppose we want to label a dataset with many facial images and we are interested in learning the similarity/likeness between each combination of faces.If there are M ∈ N >0 faces, then there are already M 2 = M • (M − 1)/2 pairwise combinations.To label all pairwise similarities of 1,000 faces would thus already require 499,500 comparisons.For large datasets, this quickly becomes too costly to label (either time or money wise), which is why AL techniques have been developed.A critical aspect in AL is the selection algorithm (the so-called query function) that determines which samples should be given to the expert.The selection algorithm can be either pre-trained using other datasets (transfer learning [5,6]), or it can be adjusted on-the-fly.AL techniques (almost always) use feature values to improve the query function, which is commonly some supervised learning method (e.g., a neural network).Yoo et al. [7] attached a module to a target network to predict the target losses for unlabeled data.Klein et al. [8] measured anomaly scores of feature values as guidance for the query function.Another common selection criterion is some kind of uncertainty sampling [9], whereby a prediction model is trained using the labeled data, and applied on the feature values of the unlabeled data.Uncertain predictions are then queried to the expert.
In this paper, we investigate an unexplored area within AL, that we call active pairwise distance learning (APDL).The objective in APDL is to actively learn the pairwise distances between all instances.Any distance function can be used, which means that APDL techniques can e.g., be used to determine likeness between faces or similarities between users for recommender systems.Furthermore, APDL methods can also be used in kinship recognition, deep fake detection, anomaly detection, dissimilarity sampling, and (pairwise) clustering.Studying APDL is therefore valuable for many research areas.It is important to emphasize that, we will not make any assumptions in this research about the relevance of the feature values to these distances (see Section 2.2 for more details), which makes our results highly generic and hence useful in many application areas.
The contribution of this research is three-fold.First, we introduce APDL, the problem of actively learning the pairwise distances between all instances.Second, we establish upper and lower bound approximations for the pairwise distances, and an update rule for these bounds.Third, we identify the best generic (domain-independent) baseline strategies for practical applications.This research can be seen as a pioneering contribution to the field of AL, which is expected to raise many follow-up studies in future research.
The remainder of this paper is organized as follows.In Section 2, we formally introduce APDL and discuss why no assumptions are made about the feature values.Consequently, we argue that techniques from unsupervised learning, semi-supervised learning and reinforcement learning are not applicable without these assumptions.Related research is discussed in Section 3. Section 4 defines notation for the selection strategies.Furthermore, it is discussed how each additional pairwise distance will update the upper and lower approximation bounds for all pairwise distances.A variety of selection strategies and selection criteria are defined in Section 5. Next, the experimental setup is addressed in Section 6.The experiments evaluate the selection strategies on multiple datasets to find the best performing strategy.The results of the experiments are discussed in Section 7. Section 9 gives an extensive overview of possible future research opportunities and addresses limitations of the results presented in this paper.Finally, Section 10 summarizes the findings.

Definition of APDL
To start, we formally define active pairwise distance learning (APDL).Starting with an unlabeled dataset consisting of M instances, the objective of APDL is to learn as much as possible about the distance between each pair of instances in T ∈ N >0 rounds.Each round, an expert can be queried to label exactly one pairwise distance.After T rounds, a final prediction is made about all pairwise distances.Given a pre-determined loss function L, the goal is to minimize the loss between the actual pairwise distance matrix D true and the predicted pairwise distance matrix D pred .Thus, the target of any APDL algorithm is to minimize L D pred , D true .In general, there are two critical components in APDL: (I) 'Which pair is queried each round?' and (II) 'How to use this information to make the best prediction?'The first question is the main focus of this research.The general approach of an APDL algorithm can be seen in Algorithm 1.

No relevancy assumption
An important assumption that we make in this research is that no assumptions are made about the relevance of the feature values to the actual distance.As a consequence, only techniques that do not use the feature values are considered.Note that having similar feature values does not necessarily mean that the underlying distance between two instances is small.Insufficient features could mean that instances appear close, but are actually far apart.Having too many features could also be troublesome for measuring similarity, as instances in a high-dimensional space are often far away (due to the infamous curse of dimensionality).Furthermore, sufficient labeled data is required to accurately extract information from the feature values in order to make good predictions.Especially for high-dimensional data and complex prediction models, more labeled data is necessary to properly train the prediction model.Gal et al. [10] even identified the lack of scalability to high-dimensional data as one of the major remaining challenges for AL.However, in practice sufficient labeled data is not always available.In addition, a recent survey [11] stated that "research remains in its infancy at present, and there is still a long way to go in the future."A badly trained prediction model could steer the query selection in the wrong direction.
Without making any assumption about the relevancy of the feature values to the pairwise distance makes most known techniques from unsupervised, semi-supervised and active learning inappropriate.Chapelle et al. [12] identify in which cases semi-supervised learning is suitable.They determine the following three assumptions in order to apply semi-supervised learning techniques: Smoothness assumption: "If two points x 1 , x 2 in a high-density region are close, then so should be the corresponding outputs y 1 , y 2 ."Cluster assumption: "If points are in the same cluster, they are likely to be of the same class."Manifold assumption: "The (high-dimensional) data lie (roughly) on a low-dimensional manifold." The smoothness and cluster assumption do not have to hold when the underlying distance metric (responsible for the actual labels) is very different from the metric that is used to measure if two points are close and if they belong to the same cluster.Consider for example determining if cars are similar using images.If the distance between two images is measured by comparing them pixel-by-pixel, it is highly likely that only the color of the car determines if two cars are similar (or even the background).Therefore, this is not a good approach.
The manifold assumption is important to combat the well-known curse of dimensionality problem.Without this assumption, a lot of data is necessary to learn the underlying distribution from the feature values.In such a situation, it might be better to make no assumptions than being steered in the wrong direction due to a lack of labeled data.
Techniques from reinforcement learning [13] have similar problems, when feature values are used.Given a specific dataset, the same action (i.e., querying the expert about a certain pair) is not repeated.Furthermore, no state is revisited and the state space can be really large.Thus, some mapping must be learned from the feature values.This inherently has the same assumption problems as discussed before.
When not to make relevancy assumption We identify six situations where it could be useful to make no assumptions about the relevancy of the feature values to the pairwise distance: (I) when there is not yet enough labeled data for supervised techniques; (II) when the underlying metric is unknown and could be too complex to predict using the given features; (III) when the features are not sufficient (IV) when there are too many features; (V) when the model should work across multiple domains; (VI) as baseline to evaluate techniques that do use feature values.
To elaborate on situation (VI), whenever for example a semi-supervised technique is developed, it should perform better than any method that does not use the feature values.Therefore, not using the feature values can be used to benchmark methods that do use feature values.
Advantages of not using feature values Not using feature values has its benefits.We list five advantages: (I) the dimensionality of data is irrelevant; (II) the quality of feature values is unimportant; (III) no hyper-parameter tuning based on feature values is needed; (IV) conclusions are not dependent on the application domain; (V) resulting baselines are ideal to be used as benchmark.As this research constitutes the first step in APDL, these are the reasons why we decide to only investigate selection strategies that do not use feature values.

Related research
To the best of our knowledge, APDL is a new research area within AL.However, there are related papers, which we will outline below.
APDL is not the same as learning pairwise preferences [14], where the goal is to make a ranking based on pairwise comparisons.In these pairwise comparisons, it is decided which sample is more preferable, which is a binary choice.A might be preferred over B, but it is not labeled by how much, which is an important distinction.Furthermore, the focus lies more on determining a good ranking function, not necessarily determining which samples should be labeled in order to gain the most information.However, it is closely related and (non-binary) preference / desirability could also be used as a distance metric within APDL.
Dasarathy et al. [15] investigate binary label prediction on a graph.A non-parametric algorithm is developed to actively learn to predict binary labels in a graph.The objective for APDL is to learn all pairwise distances, thus the graph would be fully connected.The main difference with our research is that binary labels are assumed in [15], whereas we assume that the labels are generated by a distance metric.On the one hand, it makes the problem easier, as structure is added to the labels, because properties of a distance metric need to be satisfied.On the other hand, a label can now be real-valued and not only binary, which makes prediction much harder.
Actively learning pairwise similarities has also been studied for hierarchical clusterings [16].The goal is to infer the hierarchical clustering using as few similarities as possible.These similarities are not necessarily from a distance metric, as e.g., the Pearson correlation is used in [16].The performance is assessed by evaluating the constructed tree structures.This makes APDL different, as the objective is to predict all pairwise distances, not to identify the correct tree structure.
APDL is also closely related to similarity learning and metric learning [17][18][19].These are supervised ML areas, where the goal is to learn from a labeled dataset a similarity function and a metric, respectively.The task of face verification is a practical example of these research areas.In [20], the triplet loss is used to learn a distance function from 0/1-labels to compare faces.The main difference with APDL is that similarity and metric learning require a labeled dataset in order to determine a generalized function that can be used for new samples.The objective in APDL is to gather as much distance-based information as possible about a fixed dataset, when there is yet no information about the labels.APDL is thus not concerned about finding a general function for samples outside the given dataset.APDL could be used to build the dataset that is later used by techniques from similarity learning and metric learning.
Metric learning has also been researched in an AL setting.Yang et al. [21] developed a Bayesian framework to actively learn a distance metric by selecting the unlabeled pairs with the greatest uncertainty in predicting whether the pair is in the same equivalence class or not.Kumaran et al. [22] actively learned a distance metric to identify outlier and boundary points per class, which are then given to the expert.Even more selection strategies are explored in [23].Pasolli et al. [24] used an actively learned metric to reduce the dimensionality of hyperspectral images and to select uncertain samples.Again, the goal in active metric learning is to get a model to accurately predict if two samples belong to the same class, not to determine an accurate prediction for pairwise distances.This makes APDL a fundamentally different problem.

Definitions and bounds
First, we introduce some notation that is necessary to discuss selection strategies.As seen in Algorithm 1, in round t a pair of indices ζ t := (i, j ) is chosen from M indices and a corresponding distance d(i, j ) between these indices is obtained from the expert.Although it is possible to disregard previous requests to the expert, it is obvious that previous results should be taken into account when selecting the next pair of indices.If only to avoid asking the expert the same pair twice.Therefore, we introduce the notion of history.

Definition 1 (History) Name
..,t the history of all chosen pairs of indices and their corresponding labeled distance up to and including round t.Furthermore, define H 0 := ∅ × ∅.
Next, we will define what a selection strategy is.A selection strategy for T rounds consists of T functions that successively determine which pair of indices is chosen based on the given history.

Expert distance metric
After the selection strategy determines which pair of indices is chosen, the expert determines the distance between them.An important and strong assumption we make, is that the expert makes no mistakes and that the distances originate from an underlying metric d : {1, . . ., M} 2 → [0, d max ], where d max ∈ R >0 is the maximum possible distance between two samples.In most instances, d max can be estimated or determined.However, when the maximum distance cannot be bounded from above, consider d max to be infinite.In our experiments, the underlying distance metric is the Euclidean distance between two samples and the expert simply returns the correct Euclidean distance.

Approximation bounds
To approximate the true distance between each pair of indices, we can make use of the fact that the underlying distance function d is a metric, to find upper and lower bounds.Each metric satisfies, by definition, the triangle inequality and the subsequent reverse triangle inequality.Denote the upper and lower bound of (i, j ) in round t as D upp t (i, j ) and D low t (i, j ), respectively.The metric d is symmetric (i.e., d(x, y) = d(y, x)), thus we enforce the upper and lower bounds to be symmetric as well.Therefore, it must always hold that ).We will now discuss how triangle inequalities can be used to update the upper and lower bounds each time a new distance is obtained from the expert.

Initialization
In the first round, there is no distance information yet.However, as d is a metric, it must hold that d(i, i) = 0 for each i ∈ {1, . . ., M}.Furthermore, using the range of d, the upper and lower bounds are initialized as:

Triangle inequality
The triangle inequality states that for all a, b, c ∈ {1, . . ., M} it must hold that ) .Expanding on this, for every round t it follows that

Reverse triangle inequality
).Therefore, this gives a lower bound for (a, c).Thus, Update rules In round t, we first set After the new distance d(i, j ) is given by the expert, the upper and lower bound collapse to d(i, j ), as it is assumed that the expert makes no mistakes.Thus, This newly acquired information can have an effect on other bounds as well.For all k ∈ {1, . . ., M} (1) now gives the following update rules: Note that this can lead to multiple updates, as D upp t+1 (i, k) is updated in the first line and used in the second, whereas D upp t+1 (j, k) is used in the first and updated in the second.For each bound that is now tighter than before, the same procedure should be repeated.Note that the order of the updates does not influence the end result as long as the effect of every tighter bound is evaluated.
Thereafter, lower bounds can be updated using (2).For all k ∈ {1, . . ., M}, the updates are as follows: Again, this can lead to multiple updates, similar to the upper bound updates.However, it is important to note that a new upper bound can lead to a new lower bound, but not vice versa.When an upper bound changes (e.g., D upp t+1 (x, y)), Update rules (U2) and (U3) should be evaluated (replacing (i, j ) with (x, y)).Whenever a lower bound changes (e.g., D low t+1 (x, y)), only Update rules (U3) needs to be checked.The entire update procedure is summarized in Algorithm 2, that should be applied each time a new distance label is obtained from the expert.

Strategies
In this section, we discuss the selection strategies that will be evaluated.As the APDL problem is new, we will investigate relatively straightforward strategies based on naturally arising criteria to determine the baseline strategies for future research.Without previous literature, there is yet no evidence which strategies should perform well.However, we can argue e.g., that selecting indices, where the upper and lower bound are already close, is not a good idea.Thus, sometimes we investigate a strategy that maximizes a criterion, without looking into a strategy that minimizes the same criterion, or vice versa.On top of the general definition of a strategy (see Definition 2), it is necessary to introduce some concepts and definitions that are used by certain selection strategies.A selection strategy σ consists of functions σ t for t ∈ {1, . . ., T } (see Definition 2).For all strategies that will be used, it holds that the same selection criterion is used for each σ t .In other words, the strategy does not change for different rounds.
It is possible that multiple samples satisfy some selection criterion (for example, the least chosen strategy).If more than one sample is optimal for the selection criterion, a selection between these samples is made uniformly at random.The following notation is used for this.Definition 3 (Drawn uniformly from set) Let U (A) denote the uniform distribution over a finite non-empty set A. Thus, when X ∼ U (A) it must hold that P(X = a) = 1  |A| for each a ∈ A.

Degree
It is also useful to track how often each index is chosen.Note that the problem can be visualized by a graph.Each sample is a vertex, and an edge is drawn between a pair of vertices, whenever the expert labels the distance between these pairs.How often each index is chosen is identical to the degree (from graph theory) of the corresponding vertex.Let deg t (k) denote the degree of sample k in round t.This can be determined by

Predicted distance
Let D pred t (i, j ) be the predicted distance between samples i and j in round t.We will later show (in Definition 4 below) how the distance is actually predicted.Strategies can use these predictions in a selection criterion.

Different kinds of strategies
Next, we divide the selection strategies into two groups, namely simultaneous and sequential strategies.Behind a simultaneous strategy, there is a singular selection criterion that determines which pair of indices is selected in round t out of all possible remaining pairs in for all τ ∈ {1, . . ., t − 1} .
For a sequential strategy, the indices are chosen one after the other by two (possibly different) selection criteria.To this end, if σ t (H t−1 ) = (i, j ), let σ t (H t−1 ) 1 := i and let σ t (H t−1 ) 2 := j denote the first and second index respectively.σ t (H t−1 ) 1 is chosen from the remaining first indices, thus from Whenever the first index is chosen, the remaining second indices reduce, as it is limited by the first chosen index σ t (H t−1 ) 1 .The second index is chosen from

Simultaneous strategies
First, we will discuss the simultaneous strategies, where both indices are chosen at the same time.

Random pair
Select a pair uniformly at random out of the remaining pairs.

Max bound gap
Select a pair uniformly at random out of the remaining pairs with the largest difference between the upper and lower bound of the predicted distance.

Max/min total degree
First, determine for each sample the degree, see Section 5.
Then, select a pair uniformly at random out of all remaining pairs where the sum of the individual degrees is maximized/minimized.

Sequential strategies
Next, we will discuss the sequential strategies, where the second index is chosen after the first.

Random index
Draw uniformly at random an index out of the unique set of possible remaining indices.
Criterion 6 (Random index) Note that choosing the first and second index using random index is not equivalent to using the random pair strategy, as random index uses the unique indices, where random pair does not.

Linked
This strategy can only be applied for the first index.Use the second index of the previous round as the first index of this round, unless there are no remaining pairs with this index.In this case and in the first round, choose the first index uniformly at random from the unique first indices, equivalent to the random index strategy, see (8).

Max/min degree
Choose uniformly at random an index with maximum degree (see Section 5) out of the unique set of possible remaining indices.deg t (j ) .( 14)

Max total bound gap
First, determine for each sample the bound gap with all other samples and sum these into a combined bound gap.Then, choose uniformly at random an index with maximum combined bound gap.

Max previous expected distance
In the first round, this strategy simplifies to the random index strategy (Section 5.2.1).Thereafter, choose uniformly at random an index out of the unique set of the possible remaining indices, such that the predicted distance to the indices of the previous round is maximized.

Max/min/median expected distance
This strategy can only be applied for the second index.Select uniformly at random an index out of the unique set of remaining possible indices that belong to the maximum/minimum/median of the predicted distance (see Section 6.4) to the first index.

Strategies
The goal of the experiments is to find which strategies perform well for which dataset.In Section 5, all used criteria are explained and defined.With simultaneous strategies, an index pair (i, j ) is chosen at once.With sequential strategies, a separate decision is made for the first and second index sequentially.For example, one strategy uses Criterion 8 (max degree) to select the first index, and Criterion 9 (min degree) for the second index.In total, this leads to 5 (simultaneous) + 6 • 8 (sequential) = 53 different strategies (see Table 1).Furthermore, all strategies are stochastic.Therefore, each strategy is repeated ten times for each dataset.Thereafter, results are averaged to reduce stochastic outliers.It is desirable that a strategy performs generally well, not only coincidentally.
Observe that these datasets are all two-dimensional.In other words, they have two features.Note that this is not a shortcoming for this experiment, as it is assumed that features are not relevant for the APDL techniques (see Section 2.2 above).As long as the calculated pairwise distances remain the same, these datasets could have any dimension.Two-dimensional datasets were chosen, because they can be visualized easily.

Number of rounds
The number of samples M is dependent on the dataset.Especially for increasingly large datasets, it is undesirable to keep on labeling until all labels are given.Namely, M 2 = M • (M − 1)/2 pairwise combinations can be made in total.If e.g., ten percent of the combinations should be labeled, the total number of rounds T grows exponentially in the number of samples.This gives much more opportunities to determine good upper and lower bound approximations for a large dataset compared to a small dataset.Therefore, we decide to choose the total number of rounds for a dataset in a linear-growing fashion.A minimum spanning tree (MST) in graph theory is a subset of edges in an undirected graph, such that all vertices are connected without any cycles.In total, M −1 edges are necessary to make an MST for a graph with M vertices.For each M − 1 labels given by the expert, a minimum spanning tree could have been formed.Now, let M MST := M −1 and define the total number of rounds T as 10 • M MST .This reflects a scenario where it is not possible to determine many labels, which will often be the case in practice.

Performance evaluation of strategies
In order to compare the different strategies, it is important to discuss how the performance of the strategies is evaluated.Each strategy is applied ten times on each dataset.Each round a prediction is made by averaging the upper and lower bound.
Definition 4 (Predicted distance matrix) Let D pred t be the predicted distance matrix in round t, such that Note that if (i, j ) was labeled by the expert, it holds that Definition 5 (True distance matrix) Let D true be the true distance matrix.
The prediction error between the predicted distance matrix D pred t and the true distance matrix D true can now be calculated.To compare these two matrices, the mean squared error is used.This leads to the following definition.Definition 6 (Prediction error) The error t in round t is determined as After collecting all prediction error results, three approaches are undertaken to compare the performance of each strategy: (I) average performance, (II) Borda count; (III) area under the curve (AUC).Each approach will now be explained.

Average performance
To average the prediction error results over different datasets, the error is determined at predefined rounds, specific for each dataset.As discussed in Section 6.3, the total number of rounds is dependent on the size of the dataset.Thus, in round i • M MST with i ∈ {1, . . ., 10}, the prediction error is determined.Averaging the results for a fixed i produces the final score.Summarizing, all prediction errors of a single strategy at predefined rounds are averaged for all ten repetitions and all fourteen datasets.

Borda count
A drawback of the previous approach is that certain datasets might be harder to predict correctly, making these datasets influence the average performance heavily, as the prediction error is relatively large, and all datasets are weighted equally.Thus, Borda count [35] (a voting method) is used to rank the prediction error of each strategy in the following way.First, order all strategies based on the prediction error for each dataset and repetition.The strategy with the highest prediction error gets 1 point.The second worst gets 2 points.The third highest gets 3 points and so on.This is done for each dataset and repetition in the predefined rounds {i • M MST } i=1,..., 10 .The final Borda count results are obtained by averaging over all datasets and repetitions for a fixed round.A higher score indicates better performance, and the maximum possible score is equal to the total number of strategies.

Area under the curve
Instead of comparing the results at specified iterations, it is also possible to evaluate the performance of a strategy by measuring the so-called area under the curve (AUC) for each iteration using the trapezoidal rule.For each strategy, dataset and repetition, the area under the prediction error is measured up to and including the maximum number of Round Total prediction error Fig. 2 Example AUC: Each round the prediction error is measured.The AUC of the prediction error is then determined by using the trapezoidal rule (22), which adds the area of the golden rectangles.Note that this is exactly equal to the blue area under the prediction error curve rounds (10•M MST ).As the rounds are equally spaced, AUC reduces to Note that the AUC is not necessarily bounded by [0,1].
By averaging over the repetitions, an average AUC score can be derived for each strategy and dataset.A lower score indicates better performance, as the prediction error must be minimized and the sooner this is achieved the better.A fictitious example of how the AUC is measured can be seen in Fig. 2.

Results
Following the experimental setup from Section 6, all 53 strategies outlined in Section 5 are evaluated on fourteen different datasets (see Section 6.2).The results are summarized into three tables: Section 1 gives the average prediction error results (Section 6.4.1);Table 2 shows the average Borda count score for each strategy (Section 6.4.2);Table 3 displays the area under the curve results for each strategy and dataset (Section 6.4.3).These tables all provide a different angle on the performance of the selection strategies.
Next, we will discuss the most important observations backed by evidence from Tables 1, 2 and 3.
Observation 1 There are better strategies than simply choosing a random pair.

Evidence:
The best rank random pair achieves is 13 th in Table 3 on the dataset Unbalance.Often it ranks around the mid-twenties in Tables 1, 2 and 3.This means that there are (many) strategies that perform better than random pair.Observation 2 Max total bound gap / max degree is the best strategy for earlier rounds.

Evidence:
The ranked scores of strategy max total bound gap / max degree are highlighted in Table 4.For rounds 2 • M MST up to 7 • M MST , the strategy ranks the best out of all evaluated strategies.When one has really limited labeling capabilities, this strategy performs very well across all datasets.It always has the best AUC score out of all tested strategies, except for the dataset Unbalance (see Table 3).
Observation 3 Max degree is generally a good criterion, especially in the earlier rounds.

Evidence:
In Tables 1, 2 and 3 a lot of green cells belong to a strategy with max degree.This means that it performs close to or equal to the best performance.Thus, it is a good strategy to choose at least one of the indices based on max degree.Especially in the earlier rounds.In round 3 • M MST , strategies with max degree rank in Table 1: (10 th , 07 th , 04 th , 06 th , 02 nd , 05 th , 13 th , 09 th , 12 th , 11 th , 03 rd , 01 st , 14 th ).Thus, the entire top 14 is filled by strategies with max degree except for the eight place, which is obtained by max total degree.This criterion is thus highly effective in the earlier rounds.
Observation 4 Min exp.distance and median exp.distance are bad criteria.
Evidence: Both min exp.distance and median exp.distance perform terrible.After 10 • M MST , strategies with min exp.distance and with median exp.distance are ranked (23 rd , 49 th , 53 rd , 51 st , 50 th , 52 nd ) and (16 th , 48 th , 41 st , 47 th , 46 th , 42 nd ), respectively in Table 1.Only combining with max degree can save the performance.Min exp.distance is for all other combinations colored red in Tables 1, 2 and 3, which means that it is (or close to) the worst performance.
Observation 5 Although the prediction is directly dependent on the bound gap, max bound gap is only a good strategy after > 7 • M MST rounds.

Evidence:
The ranked scores of strategy max bound gap are highlighted in Table 5.In the early rounds (up to 4 • M MST ), this strategy performs even worse than random pair.After that, it quickly becomes one of the best performing strategies, even ranking first in the later rounds.Due to the slow start, the AUC scores are remarkably mediocre, see Table 3.
Observation 6 Max exp.distance is a late bloomer.
Evidence: Whilst min exp.distance and median exp.distance perform bad, max exp.distance gets increasingly Table 2 Borda count: for each dataset and repetition, Borda count is used to rank the prediction error of the strategies and averaged in rounds i • M MST with i ∈ {1, . . ., 10}.The ranking (by column) of each Borda count score is noted in brackets.Coloring of each column is done linearly between the worst and baseline (random pair) score and linearly between the baseline (random pair) and the best score Table 5 Highlighted ranks: the ranks of the strategy max bound gap from Tables 1 and 2 1 M MST Evidence: In Table 3, every strategy has approximately the same color across datasets.This means that the relative performance is not very dependent on the dataset.However, Unbalance gives the most deviant results.This implies that the balancedness of the dataset could influence the performance of a strategy.

Real world experiment
In order to test if the observations also hold for real world datasets, we also evaluate the strategies on the cifar10 [36] and mnist [37] datasets.These datasets consist of images of ten different categories.To limit memory space and running time, we only take the first 1,000 samples of the training set for each dataset.The distance between two images is determined by the Euclidean norm, which was also used in the previous experiments.The results can be found in Table 6, where the average performance is given (see Section 6.4.1).Next, we discuss (using Table 6) if the observations from Section 7 also hold for these real world datasets.Still, there are many better strategies than simply choosing a random pair (Observation 1).Max total bound gap / max degree also remains the best strategy for earlier rounds (Observation 2), but now the performance falls off after 2 • M MST rounds.Max degree is generally a good criterion (Observation 3).The best strategies often use this criterion.Min exp.distance and median exp.distance are still bad criteria (Observation 4).But now, max bound gap is not a good strategy even after > 7 • M MST rounds (Observation 5).After 10 • M MST rounds, it ranks 30 th , whilst simply selecting a random pair ranks 20 th .Perhaps, this strategy needs even more rounds to become good.Max exp.distance is also not longer a late bloomer (Observation 6), as multiple strategies with this criterion rank higher after 10 • M MST rounds, then after 5 • M MST rounds.Furthermore, it ranks worse after 10 • M MST rounds compared with the previous experiment.Perhaps, this strategy also needs more rounds to start blooming.We believe that the difference could be explained by the dimensionality of the datasets.The cifar10 and mnist dataset have a higher dimensionality (32 × 32 × 3) and (28 × 28), respectively.It is well-known that in higher dimensional space, most points will be far away.Therefore, dimensionality could play a role in the distribution of pairwise distances.This in turn, could have an effect on some strategies such as max bound gap and max exp.distance, which is why we believe that these strategies may need more time to start performing well on these datasets.The AUC performance remains relatively stable for these datasets (Observation 7).
In general, most previous observations still hold for these real world datasets.Only some strategies that previously performed well in the later rounds, did not start improving as well on these datasets.It could be that more rounds are necessary.

Performance max degree
An important observation from both Section 7 and Table 6, is that max degree is a good criterion.The best performing methods often include this criterion.We briefly want to discuss why we believe that choosing a sample that has been already chosen often (max degree) is beneficial.In order to predict the actual distance, a lower and upper bound is established using the triangle inequality (Section 4.2).When the distance is labeled between i and j , the triangle inequality can be used to derive information about the distances between i and k if the distance between j and k is known.Therefore, labeling a sample with the highest degree, gives a lot of possible triangle inequality combinations that can be made, which could provide much information.This is why we believe that this criterion performs really well.

Discussion and future research
This research can be viewed as a pioneering contribution and is a significant first step in APDL.Below we elaborate on both the shortcomings of the approach proposed, and the related challenges for further research.

Perfect expert
It is assumed that the expert does not make any mistake in determining the distance between two instances.This is a common, yet unreasonably optimistic, assumption in AL research.Settles [38] states that "we have often assumed that there is a single infallible annotator whose labels can be trusted" and views this assumption as one of the six practical challenges for AL.How to deal with a noisy expert remains a critical research problem.A way of mitigating the mistakes of the expert in APDL is to allow some -boundary around the labels and incorporating this into the approximation bounds.Still, there are many more Table 6 Average performance (cifar10 & mnist): for each dataset (cifar10 & mnist) and repetition, the prediction error of a strategy is averaged in rounds i • M MST with i ∈ {1, . . ., 10}.The ranking (by column) of each average prediction error is noted in brackets.Coloring of each column is done linearly between the worst and baseline (random pair) score and linearly between the baseline (random pair) and the best score ways to deal with an imperfect expert, which should be investigated.Using properties of a metric, mistakes can be spotted and reevaluated.

Underlying distance metric
In all experiments, the Euclidean distance was used as underlying distance metric.This might affect the conclusions that were drawn, as alternative distance metrics might be favorable for different strategies.In future research, this could be investigated by changing the underlying distance metric and evaluating if the same strategies are always performing the best.

Complex strategies
In our research, we have examined many selection algorithms based on straightforward criteria.Newer and more complex strategies could be developed, reducing the prediction error even more.Consider for example mixing strategies, where one strategy works well in the beginning (e.g., max total bound gap / max degree) and switch to another strategy (e.g., max bound gap) that works better later on.Another way, would be to select each round a specific strategy with a certain probability.Additionally, transfer learning [5,6] can be applied to train an even more advanced model (e.g., a neural network) using labeled datasets.Such a model can be trained to choose a good strategy at a specific time, where the new prediction error can be used to either reward or penalize the selection.If the chosen strategy selected a pair that gave a lot of insight, the model can be updated to select this strategy more often in similar cases.When properly trained, the model could be applied to new datasets to determine the selection strategy.Whether this is a good approach, depends on the ability of the model to transfer the learned information over to the new dataset.

Running time
In this research, we have used straightforward criteria that are easy to compute.However, when more complex strategies are designed, running time could start to play a role.The importance of running time is mostly task dependent.The cost of coming up with the next query should be balanced with the cost of the labeling done by the expert.We consider APDL to be particularly useful in situations where the expert can only be queried a limited number of times (due to high costs).However, running time is something that should be considered in future work when more complex strategies are used.When a strategy is too hard to compute, approximation algorithms could be developed.

Running time
In this research, we have used straightforward criteria that are easy to compute.However, when more complex strategies are designed, running time could start to play a role.The importance of running time is mostly task dependent.The cost of coming up with the next query should be balanced with the cost of the labeling done by the expert.We consider APDL to be particularly useful in situations where the expert can only be queried a limited number of times (due to high costs).However, running time is something that should be considered in future work when more complex strategies are used.When a strategy is too hard to compute, approximation algorithms could be developed.The average running time of each strategy can be seen in Table 7.We believe that the difference in running time can mostly be explained by the following phenomenon.When there are more samples that satisfy the selection criterion, a random selection is made between these samples.This function takes more time, when there are more samples to choose from.Consider, for example, the difference between random index / max degree and random index / min degree that take on average 685 and 978 seconds, respectively.There are considerably more samples with the same minimum degree compared to the maximum degree.In Table 7, we observe that strategies consistently are slower when they have more samples that satisfy the criterion.

Space complexity
In the experiments, at most M = 1,000 samples were used, as this already leads to 499,500 different pairs.To store the approximation bounds for each pair, O(M 2 ) is necessary.This can quickly become infeasible for large M.Although rather time expensive, these approximation bounds could be calculated every time they are needed.Yet, for large problems, a better solution is necessary.A major insight of this research is that choosing based on max degree consistently performs well.This criterion does not use any information from the approximation bounds, which is why this is ideal for large problems, as the approximation bounds are only necessary for the final predictions.More research is necessary to optimize large APDL problems.
Using feature values It was assumed in Section 2.2 that no feature values should be used.In this way, the observations from this research are not dependent on the application domain.Furthermore, if new methods are developed that do use feature values, our tested selection strategies can function as a good baseline.Adding information (using the feature values) should only increase the performance of an APDL method.Thus, when a model is performing worse than any one of our suggested strategies, it should be considered as a major warning sign.Additionally, during the APDL process, a model could be used to evaluate if the feature values could help the prediction.If so, feature values could be introduced into the query selection after some rounds.
Gaining insight Demystifying AL can give us critical insights.Which samples are useful to query?Can we understand why?Can we explain why certain selection algorithms perform better?Is the clusteredness/balancedness of a dataset relevant?Are there better indicators for the usefulness of a sample query?Answering these kinds of questions could lead to better performing models.
Error reduction rate The reduction rate in prediction error instigates many exciting research opportunities.Can guarantees be derived about the speed with which the prediction error converges for certain strategies?It would be especially useful for practical applications to know how many labels should be gathered to get at most a prediction error of δ > 0. To derive such a guarantee, either theoretical proof or substantial numerical evidence is necessary.Additionally, the effect of a tight or loose initial upper bound for the maximum distance on the convergence speed could also be investigated.

Additional application
We think that APDL can also be used to determine the complexity of a dataset.When a strategy needs more rounds to attain a certain prediction error, the dataset might be more complex, as it is harder to learn the pairwise distances.In this way, APDL can even be useful for fully labeled datasets.Which strategies to use and how complexity is exactly quantified with APDL are all interesting subjects for future research.
Prediction model Recall that there are two critical components in APDL, namely 'Which pair is queried each round?' and 'How to use this information to make the best prediction?'The focus of our research was to answer the first question.To make a prediction of a distance, we used the upper and lower bound approximation and took the average as prediction (see Definition 4).Therein lies a large opportunity for improvement, as a more advanced prediction model could improve the final prediction as well as the query selection.Using a tuned weighted average of the upper and lower approximation could already perform better.

Summary
We started by introducing the problem of APDL, where the goal is to actively learn the pairwise distances between all instances.We established upper and lower bound approximations using properties of a distance function.Furthermore, we presented an update rule that automatically updates the upper and lower bounds using the newest labeled distance.Then, we provided fourteen selection criteria, which gave us 53 query strategies combined.These strategies do not use feature values, making the observations from the experiments domain-independent.This makes these selection strategies ideal candidates for a baseline in future research.
The experiments led to valuable new insights.These observations were tested by evaluating all strategies on two real world datasets (cifar10 & mnist).We found multiple strategies that perform better than simply randomly selecting a pair (Observation 1).This shows that it is indeed possible to 'smartly' select the indices.We determined that the performance of the strategies was not very dependent on the datasets (Observation 7).The performance only changed somewhat in a highly unbalanced case.We identified max degree to be a consistently good criterion.In Section 8.1, we explained why we believe that this criterion is useful.Consequently, we also discovered which strategies should not be chosen due to general bad performance (Observation 4).Choosing the right selection strategy could potentially save many hours and resources.The findings from the experiments are not dependent on the dimensionality of the data or (noisy) feature values, as feature values were not taken into account.However, more dimensions could lead to higher sparsity (curse of dimensionality), which is why a mix of sparse and dense datasets were used.

Algorithm 2
Update upper and lower bounds.

Fig. 1
Fig. 1 Visualization datasets: Each two-dimensional dataset that is used to test different strategies.

3 Max combined total bound gap
First, determine for each sample the bound gap with all other samples and sum these into a combined bound gap.Then, select a pair uniformly at random out of the remaining pairs with the largest sum of combined bound gaps.

Table 1
Average performance: for each dataset and repetition, the prediction error of a strategy is averaged in rounds i • M MST with i ∈ {1, . . ., 10}.The ranking (by column) of each average prediction error is noted in brackets.Coloring of each column is done linearly between the worst and baseline (random pair) score and linearly between the baseline (random pair) and the best score

Table 7
Average running time: the running time of each strategy averaged over all repetitions and datasets (including cifar10 & mnist).The ranking is noted in brackets.Coloring is done linearly between the worst and best score