We start the experiment with an artificial data experiment, used to demonstrate the characteristics of the solution. In particular, we will show how the two alternative inference strategies have very different strength and weaknesses. We also demonstrate empirically the quality of the approximation for the Gibbs-hard sampler.
After the demonstration, we compare the proposed methods with earlier matching problem solutions, using data collections analyzed by the earlier authors. We perform three different comparisons. The first is an image matching task from Quadrianto et al. (2010), the second a metabolite matching task from Tripathi et al. (2011), and the last a cross-lingual document alignment task from Djuric et al. (2012). In all cases we compare the proposed methods with the leading kernelized sorting variant CKS by Djuric et al. (2012) and the CCA-ML method by Tripathi et al. (2011) which correspond to finding the maximum likelihood solution of our model. The purpose of these comparisons is to show that the proposed solution is more accurate than the earlier solutions, while also demonstrating that the improvement comes from the Bayesian treatment of the model.
In this section we will apply the model on simple artificial data sets of varying dimensionality, to illustrate an important property of the inference strategies. We generated data sets with N=40 samples and D
ranging from 10 to 640, by sampling data from the model (2) that has four latent variables. We then applied all model variants to these matching problems, initializing them with a permutation that has 50 % correct matches to simulate a reasonably good starting point. The resulting accuracies, averaged over 20 different data sets of each size, are shown in Fig. 2 (top), displaying an interesting trend: for low dimensionality the VB algorithms are the best, but for high dimensionality Gibbs-hard is clearly superior.
The reason for this in the shape of the posterior distribution over the permutations: for high-dimensional data it is peaked around the best permutation, whereas for low-dimensional data it is extremely wide; nearly all permutations are possible. This is illustrated in Fig. 2 (bottom), which shows for each of the dimensionalities the probability of the most likely permutation, computed for data with N=8; for such a small set we can explicitly numerate all of the 40,320 possible permutations and hence can compute the actual normalized probability. We see that when the dimensionality grows, the probability assigned for the most likely permutation gets larger. For low dimensionality, the posterior is very flat, but already for D=120 the posterior is so peaked that the approximations of Gibbs-hard, VB-numInt and VB-local become accurate.
Next, we will explain why the variational approximation and the Gibbs sampler behave very differently for these two scenarios. Let us consider the high-dimensional case first. The Gibbs-hard does well because the assumption that p(π|Ξ,rest) corresponds to the most likely permutation is good, yet the sampler still explores the space of permutations effectively because Ξ is resampled every time. Gibbs-subset shows similar trend of improved accuracy for higher dimensions, but it is considerably less efficient in exploring the posterior since it does not find the best permutation for each sample but instead produces permutations with high degree of autocorrelation. This results also in clearly lower accuracy. The variational approximation, on the other hand, is a mean-field algorithm that explicitly averages over the permutations and the latent variables. Hence, for a peaked posterior the VB-local and VB-hard quickly get stuck with one permutation and the model converges to a local optimum. The VB-numInt model does better because it borrows the strength of the Gibbs-sampler; it numerically integrates over Ξ when updating the permutation, and hence can explore the space of permutations to some degree.
For the low-dimensional case the true posterior is very wide. The Gibbs-hard no longer approximates the posterior well, but even ignoring this issue we have a more fundamental problem with both samplers: Since the posterior over permutations is so wide, it becomes very difficult to estimate the rest of the parameters well. When the sampler, correctly, changes the permutation dramatically from one sample to another, it becomes nearly impossible for W and other parameters to converge towards reasonable posterior. The variational approximation, however, is inefficient in exploring this wide posterior since it averages over the possible values. Hence, the inference technique acts as a strong regularizer, making it possible to infer the rest of the parameters even though the true posterior over the matches is very wide.
Finally, we illustrate empirically the approximation error caused by the incorrect acceptance probability for the (π,Ξ) proposals in Gibbs-hard. Using a data set with N=8 samples we ran both an exact Gibbs sampler and the proposed algorithm for 10,000 independent samples and estimated the marginal posterior p(π|rest) based on the posterior samples, keeping rest of the parameters except π and Ξ fixed. Figure 3 cross-plots the log-probabilities of the two distributions for various data dimensionalities. These plots suggest that despite making a seemingly crude approximation, the Gibbs-hard sampler still produces samples from almost the correct distribution, especially for high-dimensional data. It gives somewhat too high probability for the most likely permutation, but it still gives non-zero probability mass for almost all of the same permutations as the exact sampler, and it also retains the relative probabilities of the permutations accurately.
In this problem the task is to match two halves of a set of 320 images, using the raw pixels values (40×40 pixels in Lab color space) as the input. The problem itself is completely artificial, but it has nevertheless become a kind of benchmark for the matching solutions due to the data provided by Quadrianto et al. (2010). The data has 2400 dimensions, and hence constitutes an example of a high-dimensional data for which the sampling algorithms should do well. VB-local, on the other hand, would not notably differ from VB-hard since the posterior is so peaked around the best permutation, and hence we leave it out from the comparison.
We solve the matching problem with varying subsets of the data. For VB-hard and VB-numInt, we learn L=50 different initial models for each choice of N and initialize the final model by the consensus of these matches, using K=8 components to keep the computational cost manageable. We then initialize the Gibbs variants with the result of VB-numInt, and use K=16 components. We ran the samplers for 10 parallel chains, for 500 samples each, and then found the consensus of all posterior samples; since the initialization was already a good one we did not leave a burn-in period out. For Gibbs-subset we used J=4 and 100 subset choices for each posterior sample.
Figure 4 compares the proposed methods with the p-smooth variant of kernelized sorting by Jagarlamudi et al. (2010), convex kernelized sorting by Djuric et al. (2012), and least-squares object matching (LSOM) by Yamada and Sugiyama (2011), all of which have been demonstrated to be superior to the original kernelized sorting algorithm by Quadrianto et al. (2010). In addition, we compare the proposed methods with CCA-ML, which corresponds to using (hard) EM algorithm to find the maximum likelihood solution of our model. For CCA-ML we used an initialization strategy similar to what was used for the proposed methods. That is, we ran the model L=50 times with different initializations that were slightly randomly permuted PCA-initializations. The final accuracy is the accuracy of the consensus; we also tried running the model one more time using the consensus as initialization but it typically decreased the accuracy. To avoid overfitting to the high-dimensional data, the CCA-ML method was ran on the first N/8 PCA components of each data set.
The main finding is that Gibbs-hard is the best matching solution for this data, followed by LSOM. For the whole collection with N=320 images Gibbs-hard gets 275 correct matches compared to 136 for p-smooth and 206 for CKS;Footnote 1 Yamada and Sugiyama (2011) do not report an exact number for LSOM, but extrapolation suggests it would find roughly 245 correct pairs. The variational Bayesian inference is also good as long as we use numerical integration for estimating q(π); it reaches accuracy comparable to CKS while outperforming p-smooth clearly for large sample sizes. The initialization scheme is necessary for achieving this; the individual runs used for finding the initialization only found on average less than 30 correct pairs for N=320, whereas the final run initialized with their consensus reached 201.
Gibbs-subset, which was initialized with the output of VB-numInt, produces effectively the same results as its initialization; this confirms that the sampler is too inefficient in exploring the permutations space. Considerably more samples would be required to improve the results, but since Gibbs-hard works so much better we did not spend excess computational time to do this. We also see that the VB-hard variant that only uses the most likely solution is not sufficient here. This reveals that the good accuracy of Gibbs-hard and VB-numInt is because of the posterior inference over the permutations. However, for large N VB-hard still outperforms CCA-ML, demonstrating that Bayesian inference over the rest of the parameters already helps.
One of the advantages of the Bayesian matching solutions is that in addition to learning the best permutation we can characterize the posterior over the permutations. Convex kernelized sorting can also achieve this to some degree, since it optimizes the HSIC over doubly-stochastic matrices and hence produces soft assignments as a result. Next, we will compare how well the two methods fare in terms of such soft assignments. First we look at recall of the correct pairs, by ordering for each sample x
the samples in Y according to the posterior probability of matching with x
. Figure 5 (left) shows how already 95 % of true pairs are captured within top 5 ranks. For comparison, Djuric et al. (2012) reports 81 % for the same threshold. We also inspected the actual probabilities to verify that the posterior is a reasonable distribution, and that they are consistent with the actual results. Figure 5 (right) plots the probabilities of the correct matches against the highest probabilities assigned for any pair. We see that the probabilities cover the whole range from roughly 0.1 to one, indicating that the algorithm is more certain of some pairs. We also note that it makes very few mistakes for the pairs that it assigned a high probability, indicating that the values indeed correspond to reasonable probabilities. For comparison, CKS does not assign a weight higher than 0.2 for any pair, illustrating how the soft match learned by CKS cannot be interpreted as any kind of probabilities even though they do sum up to one for each sample; the distributions are clearly too wide to represent the true uncertainty.
Next we proceed to an example data on translational medicine, taken from Tripathi et al. (2011), where the task is to match metabolites of two populations. The problem mimics a challenge where we need to align metabolites of two different species (Sysi-Aho et al. 2011), but here the two populations are both human to provide the ground-truth alignment. The data consists of time series of concentrations of N=53 metabolites, and we have measurements for several subjects. We compare our method with two methods presented by Tripathi et al. (2011), using a setup very similar to theirs. In particular, we average the matching accuracies over 100 runs where X and Y are taken from random subjects (that is, the runs are truly independent since the input data is different in each run), and we restrict the matchings so that a metabolite can only pair with another one in the same functional class (which are assumed known). We also provide another set of results without constraints, to demonstrate how well we can do without any prior information on the match.
The individual time series are of very low dimensionality, ranging from 3 to 30 depending on the subject. Hence, we only apply the variational approximation methods for this problem; the posterior over the permutations is so wide that the Gibbs-sampler variants would not work at all. We then compare our method with CCA-ML and CKS.
Figure 6 shows how we again outperform the earlier methods. VB-numInt and VB-local have comparable accuracy, and both are better than CKS and CCA-ML. The comparison with CCA-ML, which corresponds to the maximum likelihood solution of the proposed model, confirms the findings of the image matching experiment (Sect. 9.2). The Bayesian solution is advantageous in two respects. First, the difference between VB-hard and CCA-ML comes solely from doing Bayesian inference over the CCA parameters, since these two models treat the permutations in identical fashion. More importantly, however, the difference between VB-hard and the other two variants reveals that already approximative Bayesian inference over the permutations improves the accuracy dramatically. For completeness, we tried also the Gibbs samplers for this task, but as expected they did not work; they result in posteriors that are only marginally better than random assignments.
Note that in this experiment we did not use the advanced initialization strategy of learning the final model given a consensus of preliminary runs, but instead only used one initialization (based on the first PCA component) for each run. However, we did one final test to mimic the consensus matching setup of Tripathi et al. (2011), and found the consensus of the 100 runs with different input matrices to reach 85 %, compared to their result of 70 % with equal amount of data and some additional biological constraints not used in our solution.
As a third real data experiment we consider the task of document alignment. Given two collections of documents written on two different languages, the task is to find the translations by matching the documents. We use the data provided by Djuric et al. (2012), consisting of more than 300 documents extracted from the Europarl corpus and represented as TF-IDF vectors of words stemmed with Snowball.Footnote 2 Djuric et al. (2012) considered nine different matching tasks, each between English documents and documents written in one of nine other languages, and showed that CKS outperforms other kernelized sorting algorithms (the original KS algorithm, KS p-smooth and LSOM) for all tasks by a wide margin. They also achieved effectively perfect accuracy for seven of the tasks, reaching at least 98 % accuracy for each. For the remaining two language pairs, English-Swedish and English-Finnish, their accuracy was only 29 % and 37 %, respectively.
We initialized the Bayesian matching solutions with the permutation learned by CKS and then applied VB-numInt and Gibbs-hard for solving the same matching tasks, using a data representation that kept 10,000 words with the highest total TF-IDF weight over the corpus, separately for each language. For both methods we used K=16, and for Gibbs-hard we again ran 10 separate chains for 500 samples each. We also applied CCA-ML with the same initialization, using D
=50 first PCA components for representing the data. The results are summarized in Table 1, showing how the Bayesian matching solutions and CCA-ML retain the good accuracy for the language pairs CKS already solved adequately. For the two difficult language pairs all methods improve on the initialization, but Gibbs-hard is the only one that solves also those problems perfectly, reaching 100 % accuracy.
Summary of the empirical experiments
Above we performed four separate experiment to evaluate the Bayesian matching solutions. Based on both the artificial and real matching experiments we can make the following conclusions:
The proposed Bayesian matching solution outperforms the comparison methods, including kernelized sorting variants and earlier methods based on CCA. In particular, it is considerably more accurate than the maximum-likelihood solutions based on the same idea of introducing a permutation matrix as part of CCA (Haghighi et al. 2008; Tripathi et al. 2011). This confirms that the improved accuracy is because of the full posterior inference, instead of the model structure or cost function.
For high-dimensional data Gibbs-hard is the best method. It can explore the posterior space more efficiently than the variational approximation, and it produces interpretable posterior estimates with high matching accuracy. While the conditional density used for sampling the permutation is not necessarily exact, the choice of always picking the best permutation is extremely efficient compared to more justified alternatives. As illustrated in Fig. 3, it is still very accurate in producing samples from the correct posterior.
For low-dimensional data the true posterior over the permutations is so wide that properly modeling it does not produce good results. Hence, the Gibbs samplers do not work for such data. The variational approximations still provide accurate matches due to the inherent regularization effect of mean-field approximation.
All of the proposed methods depend heavily on the initialization. A good initialization can be obtained by finding a consensus of several matches. Alternatively, the methods can be initialized by the result of the convex kernelized sorting method by Djuric et al. (2012); it finds the global optimum of a relaxation of the kernelized sorting problem and produces good matching accuracy.
The practical suggestion based on these observations is to use the Gibbs-hard method for learning the matching solutions, assuming the data dimensionality is sufficiently high (at least tens, preferably hundreds or more). The method should be initialized either with a consensus learned from multiple random initializations, or with the CKS method. The consensus is best learned with the VB variants, since the samplers might have difficulties with initial solutions where almost all pairs are incorrect; then the posterior is wide irrespective of the dimensionality since most BCCA components do not describe relationships between the two sets. For low-dimensional data, we suggest using the VB-numInt method instead of the samplers.