## Introduction

The task in object matching is to learn correspondence of samples in two data sets. A classical example considers a set of agents and another set of jobs, and the task is to assign each job for exactly one agent. For each agent-job pair we have a specific cost, corresponding, for example, to how well they perform the job or how much it costs, and the goal is to find the assignment that minimizes (or maximizes) the total assignment or matching cost. The problem is known as the maximum weight matching in a bipartite graph, or as linear assignment problem. Efficient polynomial time algorithms exist for finding the match, such as the classical Hungarian algorithm (Kuhn 1955).

The classical setup takes the costs for the agent-job assignments as input. This is equivalent to assuming that we are given a distance measure between the two sets, and the goal is to minimize the total distance. In this work we study more complicated setups where no such distance is known. Instead, we are merely given two vector data sets representing the items in the two sets, with no known relationship between the feature representations. In the classical example of agent-job matching, we might have for each agent a set of features describing their physical abilities and test scores, whereas for the jobs we could have the text of the job advert. The task is still to assign each job for one of the agents, but obviously the standard algorithms for solving the assignment problem do not apply. To highlight the notion of possibly wildly different feature representations, Yamada and Sugiyama (2011) used the phrase cross-domain object matching (CDOM) to denote the problem.

Since no distance between the sets is given, we must replace the task of minimizing the total distance by something else. In recent years, a number of solutions have been proposed based on maximization of statistical dependency, measured for example by the mutual information, between the two sets. This corresponds to making an assumption that the correct assignment is the one that reveals the highest degree of statistical dependency between the sets. One intuitive justification for the idea is the observation that randomly permuting samples of the correct match will decrease the dependency, eventually making the two sets independent if all pairs are replaced with random ones. To our knowledge, the idea of finding the match that maximizes the statistical dependency was independently first suggested by Haghighi et al. (2008), Tripathi et al. (2009), and Quadrianto et al. (2009). The first two measure the dependency with linear canonical correlations, whereas the last one uses the kernel-based Hilbert-Schmidt Independence Criterion (HSIC; Smola et al. 2007), but conceptually the methods are closely related. Later for example Tripathi et al. (2010), Tripathi et al. (2011), Jagarlamudi et al. (2010), Quadrianto et al. (2010), Yamada and Sugiyama (2011) and Djuric et al. (2012) have presented improved models and extensions based on the same idea.

An alternative criterion for learning the match comes from joint modeling. Given the two sets, we can write a joint generative model for both, including a permutation over the samples of one of the sets as part of the model. Then a reasonable assumption is that the correct match is obtained with the permutation that results in the best joint model. This idea was proposed before the dependency-maximization solutions by Jebara (2004) who maximized the likelihood of a joint Gaussian model. Another example of a method optimizing the joint likelihood is the matching canonical correlation analysis (MCCA) by Haghighi et al. (2008), which was already mentioned in the previous paragraph; it can be seen as a method that both maximizes the canonical correlation between the two sets but also maximizes the likelihood of a specific probabilistic model, namely the probabilistic interpretation of canonical correlation analysis (CCA; Bach and Jordan 2005).

In this work we present another solution that fits both motivations, extending our preliminary publication (Klami 2012). We solve the matching problem by building an optimal joint model, but instead of maximizing the likelihood we do full Bayesian inference, searching for a posterior distribution over the permutations to characterize the set of possible matches. The connection to dependency-maximizing solutions is obtained by choosing Bayesian canonical correlation analysis (BCCA; Klami et al. 2013) as the underlying probabilistic model. Since the model implements CCA, our solution will also maximize the canonical correlation between the sets. Furthermore, it corresponds to the Bayesian solution of MCCA.

The match is learned by introducing a permutation parameter π, which is a N×N binary matrix with unit row and column sums, into the BCCA model. The main challenge in Bayesian analysis of the model is then in learning the posterior over the permutations. While the set of permutation matrices is discrete, there are N factorial different permutations. Exact inference over such a huge space is not feasible, and hence the main contributions of this work are several alternative approximative strategies that enable simultaneous posterior inference over the permutations and the rest of the BCCA parameters. In particular, we will present both a Gibbs sampler with approximative and exact conditional distributions for the permutations given the rest of the parameters, as well as variational approximation with varying degree of accuracy for a term approximating the posterior over the permutations. The former is a novel contribution of this work, whereas the latter was preliminary presented already by Klami (2012).

To empirically evaluate the various alternative approximations, we compare the proposed method with the best existing algorithms using three benchmark data sets: matching left and right sides of images based on their content, matching metabolic profiles of different individuals, and cross-lingual document alignment. For all experiments we compare the proposed methods with the leading kernelized sorting variant of convex kernelized sorting (CKS) by Djuric et al. (2012) and the maximum likelihood solution of our model, which corresponds to Haghighi et al. (2008) and Tripathi et al. (2011). In all three tasks the proposed methods outperform the earlier methods.

We will start by formally introducing the matching problem and our Bayesian formulation for it. We then summarize the Bayesian canonical correlation analysis (BCCA) model as presented by Klami et al. (2013), which is used as the underlying latent variable model in our matching solutions. We go through both the sampling and variational inference for that model, and then proceed to the main contributions of this work: The matching Bayesian CCA. Again we cover both the sampling and variational inference, before explaining related work and the empirical comparisons.

## Object matching

### Problem formulation

Given two sets of N objects, denoted by $$\mathcal{X}$$ and $$\mathcal{Y}$$, the goal is to discover a permutation matrix π over the objects in $$\mathcal{Y}$$ such that the ith object in $$\mathcal{X}$$ corresponds to the object in $$\mathcal{Y}$$ for which π ji =1. The correspondence is defined by a cost function $$c(\mathcal{X}, \mathcal{Y}|\boldsymbol{\pi})$$ which is maximized with respect to $$\boldsymbol{\pi}\in\mathcal{P}$$. Here $$\mathcal{P} \in\{0,1\}^{N \times N}$$ is the set of all N×N permutation matrices, binary matrices with unit row and column sums.

In this work the samples will be represented as real-valued vectors, and hence the sets are represented as matrices $$\mathbf{X}\in \mathbb{Re}^{D_{x} \times N}$$ and $$\mathbf{Y}\in\mathbb{Re}^{D_{y} \times N}$$. Individual samples will be denoted by column vectors x i and y j . In general, there does not need to be any correspondence between the feature spaces of the two sets; their dimensionalities can differ, as well as the actual features.

In case we have a distance measure between the two sets, providing distances d ij between the samples x i and y j , the problem is straightforward. It is an instance of bipartite graph matching problems, and can be solved as a linear assignment problem (Kuhn 1955) by minimizing the total distance $$c(\mathcal{X}, \mathcal{Y}|\boldsymbol{\pi}) = \sum_{i=1}^{N} \sum_{j=1}^{N} \boldsymbol{\pi}_{ji}d_{ij}$$. In this work we are interested in scenarios where no such distance is known. Then we need to use alternative costs that, when maximized, result in a good match.

### Bayesian matching

We formulate a solution to the matching problem by straightforward joint modeling. We assume the data is generated by a latent variable model of the type

$$p(\mathbf{X},\mathbf{Y}) = \prod_{i=1}^N \int p( \mathbf{x}_i|\mathbf{z}_i)p( \mathbf{y}_i| \mathbf{z}_i) p(\mathbf{z}_i) d\mathbf{z}_i.$$

That is, the i.i.d. samples in the two data sets are conditionally independent given a latent variable z i .

The matching is introduced as an explicit permutation matrix $$\boldsymbol{\pi} \in\mathcal{P}$$ applied to re-order the samples in one of the data sets. We indicate by π ji =1 that the sample y j pairs with the latent variable z i (which is associated with the sample x i ), and using π .i to denote the ith column we can write the Bayesian matching model as (1)

The task in Bayesian matching is then to find the posterior distribution of the permutation, p(π|X,Y), which cannot be done analytically. In this work we will introduce two alternative strategies for approximating it for one particular model. One approach is based on variational approximation of the posterior and the other uses Gibbs sampling to draw samples from the posterior. While any joint model could in principle be used, the inference details will depend on the choice of the model. Next, we will summarize our choice for the underlying model before explaining how it needs to be modified to solve the matching problem.

## Bayesian CCA

As the actual model we use the Bayesian CCA model as presented by Klami et al. (2013), which matches the choice of maximizing correlation made by the earlier solutions of Haghighi et al. (2008) and Tripathi et al. (2011). This means our approach is maximizing both the joint marginal likelihood and a dependency measure. Furthermore, it has the intuitively appealing property that we need not consider variation independent of the other set while learning the match, since CCA uses separate components independent of the permutation for modeling that.

The Bayesian CCA is a fairly simple linear model for two multivariate data sets. Each sample is represented by a latent variable z i which is linearly transformed to both observation spaces, complemented with additive Gaussian noise. The Bayesian CCA for K components is defined as

\begin{aligned} \mathbf{z}_i &\sim\mathcal{N}(\bf{0}, \mathbf{I}), \\ [\mathbf{x}_i;\mathbf{y}_i] &\sim\mathcal{N}( \mathbf{W} \mathbf{z}_i,\boldsymbol{\varSigma}), \end{aligned}
(2)

where [x i ;y i ] denotes the feature-wise concatenation of the samples with D=D x +D y dimensions and $$\mathbf{z}_{i} \in\mathbb{R}^{K \times 1}$$. The basic idea of BCCA is that the latent components model only the correlations. To achieve this, we either need to use block-diagonal covariance Σ that allows free correlations between the features within each set (Bach and Jordan 2005; Klami and Kaski 2007), or we can use diagonal covariance

$$\boldsymbol{\varSigma}= \left [ \begin{array}{c@{\quad}c} \tau_x^{-1} \mathbf{I}& {\bf0} \\ \bf{0} & \tau_y^{-1} \mathbf{I} \end{array} \right ],$$

but need to model the view-specific correlations with additional components (Virtanen et al. 2011; Klami et al. 2013); this equals assuming the view-specific variation can be modeled with low-rank covariance.

Here we adopt the latter choice, which results in considerably faster and more accurate algorithm for high-dimensional data. In particular, we impose group-wise sparsity prior on $$\mathbf{W}\in\mathbb{Re}^{D \times K}$$ so that some of the K components are used for modeling dependencies between the two sets, whereas some are used for describing variation independent of the other set. By re-writing W=[W x ;W y ] so that W x covers the dimensions spanned by X (and similarly for W y ), we want solutions where some of the columns of W become sparse in a very specific sense: Either W x or W y , or both, is completely zero for that component.

Such a solution splits the components into three groups. For some components all elements are free; these are the components that model the correlations between the two sets. For some components the elements in W x are free but the elements in W y are zero; these model variation specific to the data set X. Similarly, the components for which W y is free but W x is zero model the variation specific to Y. Finally, components for which both parts become zero can be dropped out, resulting in automatic complexity selection.

The model is illustrated in Fig. 1, which shows the plate diagram of the generative mode, as well as graphical illustration of how the group-wise sparsity makes some of the components to model low-rank covariance specific to each data set. The figure shows the model as it is used for matching, explained in Sect. 4, but dropping the permutation makes the illustration applicable to regular BCCA as well. Next we will briefly recap two alternative inference algorithms for BCCA following Klami et al. (2013), using two different approaches for achieving the group-wise sparsity structure, before explaining the matching variant.

### Gibbs sampling with spike-and-slab prior

The spike-and-slab is a mixture prior that gives some probability for the value to be exactly zero (the spike) and some probability for it to be drawn from a proper distribution (the slab). We use a group-wise extension of spike-and-slab by drawing for each component two binary latent variables h xk and h yk that tell whether to draw the values of W xk and W yk from the slab (a Gaussian distribution) or from the spike (delta distribution at zero). Here W xk denotes the D x -dimensional vector corresponding to the kth component, the kth column of W x . The h are drawn from Bernoulli distributions, and for h xk =1 we then draw the elements of W xk independently from a Gaussian distribution whose precision β xk is sampled from a Gamma prior. For h yk =0 we simply set all elements of W xk =0 to zero, and the formulas for Y are equivalent except for the subscript.

Klami et al. (2013) derived a Gibbs sampler for BCCA using the above prior, by extending the element-wise sparse factor analysis model by Knowles and Ghahramani (2011). The algorithm is conceptually very straightforward, drawing each parameter from its analytic posterior given all of the other parameters. Since the model uses conjugate priors for all of the parameters, these can be derived easily. The latent variables h, however, are drawn by integrating out W; the resulting posterior is still analytically tractable. The full conditional densities are summarized in the Appendix.

### Variational approximation with ARD prior

An alternative for sampling is to approximate the posterior distribution with a simpler, tractable, distribution. The approximation is learned by minimizing the Kullback-Leibler divergence between the approximation and the true posterior. We adopt the variational approximation provided by Klami et al. (2013), which uses group-wise extension of the automatic relevance determination (ARD) prior for achieving the right kind of sparsity.

The prior is defined as where W xk again denotes the kth column of W x , and $$\mathcal{G}(\alpha_{0},\beta_{0})$$ is a flat Gamma prior. It achieves the group-wise sparsity by driving the precision β xk towards infinity for the components that are not needed for modeling the X data set (and similarly for Y). Contrary to the spike-and-slab prior, this does not result in exact zeroes in W. Instead, the unnecessary values will just become very small; this is still sufficient for the model to provide the CCA solution, and it is easier to do variational approximation over a continuous prior.

An efficient variational approximation is provided by the factorization

$$Q=q(\tau_x)q(\tau_y)\prod_{k=1}^K q(\beta_{xk})q(\beta_{yk}) \prod _{i=1}^N q(\mathbf{z}_i) \prod _{d=1}^D q(\mathbf{W}_{d.}),$$
(3)

and the parameters of each term are learned by updating alternatively each of the term. The full updates are given by Klami et al. (2013), and are hence not replicated here.

## Matching Bayesian CCA

The matching Bayesian CCA model extends the above formulations by replacing the likelihood part in (2) with

$$[\mathbf{x}_i;\mathbf{Y}\boldsymbol{\pi}_{.i}] \sim \mathcal {N}(\mathbf{W} \mathbf{z}_i,\boldsymbol{\varSigma}),$$

where $$\boldsymbol{\pi}\in\mathcal{P}$$. We assume a uniform prior over all N×N permutation matrices, but could easily incorporate priors obtained from additional information sources as long as they factorize over the sample pairs. The resulting model is illustrated in Fig. 1.

Including the permutation in the model requires also corresponding changes in the inference procedures. These changes are conceptually very easy. For the sampler we merely include the new parameter in the model, derive a distribution for sampling it given the rest of the parameters, and condition all other sampling formulas on the permutation. For the variation approximation we complement (3) with an extra term q(π), which is a distribution over permutation matrices, and an update rule for that term. Also, we need to change the updates for other terms to integrate over the new term.

In both approaches most of these changes are easy to do. However, the part where we either sample the permutation or update its approximation is non-trivial. We will next present several alternative techniques for achieving these steps, starting with the Gibbs sampler.

## Matching BCCA with Gibbs sampler

Given a specific permutation, the changes needed for the sampling equations of W, h, β and τ are trivial; we merely need to re-order the samples in Y by multiplying it with the current permutation π before drawing the new samples.

The two remaining parameters, the permutation π and the latent variables z i depend cyclicly on each other, and hence we sample them jointly, based on a slight re-parameterization of the model. The latent variables z i can be equivalently written as

$$\mathbf{z}_i = \boldsymbol{\mu}_i^x + \boldsymbol{\mu}_j^y + \boldsymbol{\xi}_i$$

where $$\boldsymbol{\mu}_{i}^{x} = \boldsymbol{\varSigma}_{z} \tau_{x} \mathbf{W}_{x}^{T} \mathbf{x}_{i}$$, $$\boldsymbol{\mu}_{j}^{y} = \boldsymbol{\varSigma}_{z} \tau_{y} \mathbf{W}_{y}^{T} \mathbf{y}_{j}$$, and ξ i is zero-mean Gaussian noise with covariance $$\boldsymbol{\varSigma}_{z} = (\tau_{x} \mathbf {W}_{x}^{T}\mathbf{W}_{x} + \tau_{y} \mathbf{W}_{y}^{T}\mathbf{W}_{y} + \mathbf{I})^{-1}$$. In other words, the posterior distribution depends on two deterministic terms, one that depends on the sample x i and the other that depends on the corresponding sample y j in the other set, as well as an independent noise term. Importantly, the random term does not depend on the permutation at all; changing the permutation only influences the second deterministic term.

Using Ξ to denote the collection of all ξ i variables, we can write the conditional distribution p(π,Z|rest) as p(π,Ξ|rest)=p(π|Ξ,rest)p(Ξ|rest), decoupling the two variables. Here “rest” indicates all of the other model parameters, which are not explicitly written for notational simplicity. In principle it would be easy to draw samples from this conditional; the latter term is Gaussian and the former is a discrete distribution for which we can evaluate the log-probabilities as

$$\log p(\boldsymbol{\pi}|\boldsymbol{\varXi}) = - \sum _{i=1}^N \tau_y (\mathbf{y}_j - \mathbf{W}_y \mathbf {z}_i)^T( \mathbf{y}_j - \mathbf{W}_y \mathbf{z}_i) + \mathrm{const},$$

where j is chosen such that π ji =1. However, for directly drawing posterior samples from this discrete distribution we would need to normalize the probabilities with a term that requires computing the probability for all possible permutations. This is infeasible for all but the smallest N, since there are N! different permutations.

To handle the difficult conditional we propose an approximative Metropolis-Hastings step with Gibbs-like proposal distribution q(π ,Ξ )=q(π |Ξ )q(Ξ ) that does not depend on the current values of π and ξ. For the latter term we use the actual conditional distribution q(Ξ )=p(Ξ |rest) that is a Gaussian, whereas for q(π |Ξ ) we use a simple proposal distribution that is the delta distribution centered around the most likely permutation given Ξ and the other parameters. We can easily find that permutation by solving a LAP with the cost matrix

$$\mathbf{A}_{ij} = \tau_y \bigl(\mathbf{y}_j - \mathbf{W}_y \mathbf{z}_i^* \bigr)^T \bigl( \mathbf{y}_j-\mathbf{W}_y \mathbf{z}_i^* \bigr).$$
(4)

The Metropolis-Hastings ratio for acceptance of the proposals is given by

$$\frac{p(\boldsymbol{\pi}^*,\boldsymbol{\varXi}^*)q(\boldsymbol {\pi},\boldsymbol{\varXi})}{p(\boldsymbol{\pi},\boldsymbol{\varXi })q(\boldsymbol{\pi} ^*,\boldsymbol{\varXi}^*)} = \frac{p(\boldsymbol{\pi }^*|\boldsymbol{\varXi}^*)p(\boldsymbol{\varXi}^*)q(\boldsymbol {\pi}|\boldsymbol{\varXi})q(\boldsymbol{\varXi})}{ p(\boldsymbol{\pi}|\boldsymbol{\varXi})p(\boldsymbol{\varXi })q(\boldsymbol{\pi}^*|\boldsymbol{\varXi}^*)q(\boldsymbol{\varXi }^*)} = \frac{p(\boldsymbol{\pi}^*|\boldsymbol{\varXi }^*)}{p(\boldsymbol{\pi}|\boldsymbol{\varXi})}$$

where q(π|Ξ)=q(π |Ξ )=1 and q(Ξ)=p(Ξ) cancel out. Consequently, the proposals should be accepted with probability min(1,p(π |Ξ )/p(π|Ξ)). Computing the ratio is infeasible due to the normalization constants that are sums over all permutations.

We propose approximating the ratio with a constant p(π |Ξ )/p(π|Ξ)=1, which means accepting all proposals. The sampler will hence not draw samples from the true posterior, but in practice the approximation is fairly accurate due to two properties. First, the relative probability of the most likely permutation is effectively independent of the likelihood of that permutation. This means that the sampler does not impose bias in favor or against permutations that fit the data well. Instead, it merely has a small bias towards Ξ that would give roughly equal probability for several permutations, the kind of latent variable allocations where some variables lay near the Voronoi border of two data points. Second, for high-dimensional data p(π|Ξ) approaches one (and hence also the ratio approaches one), since the best permutation becomes increasingly more likely compared to all others. We provide empirical evidence for these arguments in Sect. 9.1. Using small enough N that makes explicit normalization of p(π|Ξ) feasible and hence permits exact Gibbs steps, we show that the resulting marginal posterior over the permutations is a good approximation for the true marginal posterior. An intuitive explanation is that even though we use a crude approximation for q(π|Ξ), the marginal posterior depends more on Ξ and hence the joint samples will still cover roughly the same part of the permutation space. This also makes the chain ergodic.

In case one is not willing to make the above approximation, an exact sampler can be derived by updating only a subset of latent variables at a time. For a subset of sufficiently small size (in practice at most 6 or 7) we can go though all possible permutations and compute the relative probabilities of each of them. We then draw that subset of the permutation matrix by conditioning on the rest of the parameters and the remaining part of the permutation. In practice we suggest drawing a random subset of samples, updating the posterior for those, and repeating this procedure several times to obtain the next posterior sample.

An alternative exact sampler could be derived based on the pseudo-marginal likelihood method by Andrieu and Robers (2009). They prove that the even if the likelihoods in Metropolis-Hastings acceptance ratio are replaced with estimates the chain will still produce samples from the correct posterior, assuming the estimation error is independent of the sampling chain. In our case, we could estimate the likelihood p(π|Ξ) by estimating the normalization constant e.g. with random walks over the permutations. However, we do not think that the added computational overhead needed to estimate the normalization constant would be worth it, and hence we experiment only with the approximative variant. From this perspective, it corresponds to using a constant estimate p(π|Ξ)≈c, which already seems to have almost independent approximation error.

### Posterior summaries

The samplers described above produce a collection of samples from the posterior distribution p(Z,π,τ,h,W,β|X,Y). In matching problems the primary interest is in the marginal posterior p(π|X,Y), which is obtained by counting how many times each of the possible permutations were drawn during the process.

Various posterior summaries can be useful for interpretation. The most obvious one is the mean over the permutations $$\tilde{\boldsymbol{\pi}}= \frac{1}{S} \sum_{s=1}^{S} \boldsymbol{\pi}^{(s)}$$, where π (s) is the sth posterior sample collected after a burn-in period and S is the total number of samples. This mean is not a permutation matrix itself, but instead a doubly-stochastic matrix that can be interpreted as a soft permutation; its entries tell the probability that two samples are paired with each other.

In many applications we are also interested in obtaining a single permutation that summarizes the whole posterior. We create the summary by finding the permutation π that maximizes $$\sum \mathrm {diag}(\tilde{\boldsymbol{\pi}} \boldsymbol{\pi})$$, the total probability of each sample pairing with its chosen pair. The cost is again that of an assignment problem, and hence we obtain the solution by applying LAP to the weight matrix $$\tilde{\boldsymbol{\pi}}$$. This solution is the same that Tripathi et al. (2011) used for finding a consensus of multiple isolated matching tasks.

## Matching CCA with variational approximation

The variational approximation for matching CCA was presented in our preliminary work (Klami 2012). Again the basic idea is straightforward; we merely complement the posterior approximation with q(π), which is a distribution over a set of the permutations.

Even approximating the distribution that is over the N!-dimensional space initially sounds infeasible, but in practice the problem is simplified by the same observations that were used to motivate the Gibbs sampler above: (i) We can efficiently find the permutation $$\hat{\boldsymbol{\pi}}$$ that maximizes the variational lower bound and (ii) only a tiny faction of other permutations have clearly non-zero probability. Given these observations, we can construct q(π) by properly normalizing a distribution over a small set of permutations $$S = \{\boldsymbol{\pi}^{(m)}\}_{m=0}^{M}$$ that includes $$\hat{\boldsymbol{\pi}}=\boldsymbol{\pi}_{0}$$ and some nearby permutations.

Given a set S of feasible permutation matrices with their associated weights w m , it is easy to compute the expectation $$\langle \boldsymbol{\pi}\rangle= \frac{1}{\sum w_{m}}\sum_{m=0}^{M} w_{m} \boldsymbol{\pi}^{(m)}$$. Simple verification confirms that the update rules of regular Bayesian CCA can be re-used for all other terms in the approximation, assuming we simply transform Y by 〈π〉 after updating the match. Hence, the only challenging part is again in updating the permutations.

### The most likely permutation

For the Gibbs sampler we could find the most likely permutation by merely looking at the distances between y j and W y z i . For variational approximation, however, we need to integrate over the approximating distribution. The integrals can be computed analytically since the approximation factorizes over the different terms, but the resulting formulas are a bit more complex than for the sampler.

The log-cost of a permutation is given by

$$\log w_m \propto- \frac{1}{2} \sum _{i=1}^N \bigl\langle\tau_y \bigl( \mathbf{Y} \boldsymbol{\pi}_{.i}^{(m)} - \mathbf{W}_y \mathbf{z}_i \bigr)^T \bigl(\mathbf{Y}\boldsymbol{ \pi}_{.i}^{(m)} - \mathbf{W}_y \mathbf{z}_i \bigr) \bigr\rangle,$$

where 〈⋅〉 denotes the expectation over the approximating posterior. To find the most likely permutation $$\hat{\boldsymbol{\pi}}$$, we collect all pairwise expected distances into a single N×N matrix A with entries (5)

where Tr(⋅) denotes the matrix trace. A regular linear assignment problem solver will then find a globally optimal choice of $$\hat{\boldsymbol{\pi}}$$ so that $$\sum\mathrm {diag}(\mathbf{A}\hat{\boldsymbol{\pi}})$$ is minimized.

Similar to the re-parameterization in the Gibbs sampler, the values 〈z i 〉 are based on a representation that de-couples the permutation and the stochastic part. This time the relevant expression stems from the update rule for $$q(\mathbf{z}_{i})=\mathcal {N}(\boldsymbol{\mu} _{ij},\boldsymbol{\varSigma}_{z})$$ that can be written as Again the permutation only influences the second deterministic term in the mean μ ij . This re-parameterization allows writing the distances in the compact form (5) and also makes transparent an additional factorizing assumption: we assume that the uncertain part ξ i =z i μ ij (with covariance Σ z ) is independent of j. This is a necessary assumption for efficient computation, but means that the approximation is more restricted.

Finally, since CCA automatically infers which of the K components model correlations between the two sets, it makes sense to learn the match only over those components. We can do that by analytically integrating out z i for the components that do not describe any variation in X (that is, β xk is very large) – the ones not active in Y anyway have no effect on A. Denoting by V y the columns of W y corresponding to the components marginalized out and by U y the remaining columns, the entries in (5) are replaced by $$\frac{1}{2} \langle(\mathbf{y}_{j}-\mathbf{U}_{y}\mathbf {z}_{i}) \boldsymbol{\varPsi}(\mathbf {y}_{j}-\mathbf{U}_{y}\mathbf{z}_{i}) \rangle$$, where $$\boldsymbol{\varPsi}= ( \mathbf{V}_{y} \mathbf{V}_{y}^{T} + \tau_{y}^{-1} \mathbf{I})^{-1}$$. For efficient inversion we use the Woodbury identity $$\boldsymbol {\varPsi}= \tau_{y} \mathbf{I}- \tau_{y}^{2} \mathbf{V}_{y} (\mathbf{I}+ \tau_{y} \mathbf {V}_{y}^{T} \mathbf{V}_{y} )^{-1} \mathbf{V}_{y}^{T}$$. For large D y we can further approximate Ψ=τ y I for faster computation, since the matrix is anyway close to diagonal in many cases.

### Full posterior over the permutations

To create a full distribution over the permutations we need to complement the most likely permutation with a set of other reasonable permutations. For this purpose we propose two alternative strategies.

### Local perturbations

By making the assumption that ξ i are independent of the permutation π, we can directly use costs of the form $$w_{m} \propto e^{-\sum\mathrm{diag}(\mathbf{A}\boldsymbol{\pi}^{(m)})}$$ to obtain the relative probabilities of different permutations π (m). We can then approximate the full posterior by repeatedly slightly perturbing $$\hat{\boldsymbol{\pi}}$$ and computing the relative cost of the resulting alternative permutations. This will produce a local unimodal approximation centered around the mode, to complement the other unimodal terms in (3).

To create the set of other feasible permutations, we find N other permutations that differ minimally from the optimal one. We exclude one pair at a time from the optimal match (by setting the corresponding element in A to infinity) and solve the AP again. All of the resulting matches π (m) will be worse than the optimal one, but will be maximally close in terms of probability due to having only one extra constraint. Note, however, that they will typically differ for multiple pairs, since the single constraint propagates to multiple changes in the full permutation. For each unique π (m) we then evaluate the associated cost w m , and the expectation over the approximative posterior is given by the weighted average $$\langle\boldsymbol{\pi}\rangle= \frac{1}{\sum w_{m}} \sum_{m=0}^{M} w_{m} \boldsymbol{\pi}^{(m)}$$. Here $$\boldsymbol{\pi}^{(0)}=\hat{\boldsymbol{\pi}}$$ denotes the most likely permutation. Note that MN, since we create N alternative permutations, but it is possible that several constraints result in the same permutation.

An interesting observation is that for large D y the weight w m will be negligible for most π (m). Even though the algorithm generates maximally similar permutations by adding always just one additional constraint, most of them have very low probability for high-dimensional data. Hence, for high-dimensional data it might be feasible to ignore the alternative permutations altogether and simply use the most likely one. For low-dimensional data the posterior over the permutations is smooth, and no efficient algorithm for finding all likely permutations exists. The above process will, however, generate a reasonable subset of those in the immediate vicinity of the optimal one, providing a local approximation of the posterior around its mode. It is also easy to extend the procedure to create larger set of likely permutations, for example by creating also alternative permutations with more than one constraint. In the experiments we use the algorithm as described above, creating only the N alternative permutations. Preliminary tests indicate that covering bigger set of likely permutations slightly improves the accuracy, but for these applications the gain is small compared to the increased computational cost.

### Numerical integration

Like explained above, the straightforward variational approximation makes one fairly strong independence assumption: It assumes that ξ i , the stochastic parts of the latent variables z i , are independent of the match. This assumption allows efficient computation of A ij , but it also implies that the relative probabilities of the different permutations will be incorrect. In particular, the uncertainty in the distances is not fully taken into account, but instead the model favors strongly the closest samples even if the variance of the distance was large.

To avoid the problem we should model the dependencies between the choices instead of allowing each y j to independently integrate over z i . If a particular value is chosen for z i then it should simultaneously become, for instance, both a better pair for y j and a worse pair for some other sample y l . In other words, the posterior distributions depend on the chosen match: Conditional on the choice “y j is paired with x i ”, the distribution of W y z i shifts closer to y j , and often this implies it shifts away from some other samples.

Explicitly modeling such dependencies requires simultaneously integrating over the N latent variables z i to compute the relative costs of different permutations. Solving such an integral analytically seems hopeless, and hence we resort to Monte Carlo integration which closely resembles the way the Gibbs sampler works. Instead of assuming independent set of ξ i values for each sample y j and integrating them out (which would correspond to (5)), we draw a joint random sample $$\boldsymbol{\varXi}^{(m)} = \{\boldsymbol{\xi}_{i}^{(m)}\}_{i=1}^{N}$$ that is independent of j. We then re-compute A ij for all pairs, not integrating over z i but instead replacing them with the sampled value (note, however, that we still integrate over q(τ y ) and q(W)). We then find the best permutation for this particular sample, and by repeating the process M times we can estimate 〈π〉 as the unweighted average of the resulting permutations.

It is worth noting that the above numerical integration scheme does not result in monotonically increasing variational lower bound for the marginal likelihood. We have not found this to be a problem in practice, and as shown by the experiments the resulting match is better than the one obtained when not modeling the dependencies between the latent variables. The gain is particularly clear for high-dimensional data.

### Initialization

As demonstrated by most earlier works, the matching problem is very sensitive to the initialization. Our variational solution is no exception, due to the iterative mean-field algorithm for updating the variational approximation. Hence, we present an initialization scheme that borrows elements from several earlier solutions.

The basic idea is that we solve the problem L times, each time with a different initialization. We then create a consensus of those matches, following the idea by Tripathi et al. (2011), by counting how many times each of the sample pairs were matched together in the set of the L solutions. Finally, the actual solution is computed by initializing the model with the consensus (normalized to probabilities) and solving the matching problem one more time.

The reasoning behind the strategy is that by choosing diverse initial models we can better cover the space of potential matches, making the unimodal posterior approximation less of a limitation. However, each individual initialization should still be a good one. To get a set of different but still good initializations, we use a modified version of the PCA-initialization suggested by Quadrianto et al. (2010). For each of the L initializations we compute PCA separately for both sets and order the samples according to the first component. For increasing the diversity, we compute the PCA from random subset of N/2 dimensions instead of the whole data, getting slightly different initialization for each run. Alternatively, one could add some random noise to the first PCA component before sorting the data points. Finally, we make the initial permutation smoother by convolving it with a Gaussian kernel (the exact width does not seem to matter).

## Related work

Most earlier solutions to the matching problem maximize statistical dependency between the two sets. The idea is that statistical dependency should not arise by coincidence, but instead a high degree of dependency should be indicative of having found the correct match. A random permutation will make any two sets independent, and hence maximizing the dependency will at least allow escaping that extreme. The practical methods can be divided into two categories based on the dependency measure. The first category optimizes canonical correlation between the sets, whereas the other category maximizes a kernel-based dependency called Hilbert-Schmidt Independence Criterion (HSIC; Smola et al. 2007) or some other kernel-based measure. The former require access to real-valued feature vectors for the two sets, whereas for the latter it is sufficient to provide kernels representing pairwise-distances within each set.

The matching canonical correlation analysis (MCCA) method was introduced by Haghighi et al. (2008) for constructing bilingual dictionaries from monolingual corpora, by matching the individual words in two languages. The basic idea of the algorithm is that it finds a linear subspace that maximizes the correlation between the sets (CCA; canonical correlation analysis). Explicit representation for the subspace allows computing distances between the samples in the two sets, and hence a LAP can be used for finding the match. Since the subspace itself depends on the match, the algorithm alternates between these two steps until convergence. The original algorithm is defined for semi-supervised matching that requires an initial seeding with some known pairs, but Tripathi et al. (2009) presented independently almost the same algorithm for fully unsupervised matching of probes of gene expression platforms, and Tripathi et al. (2010) extended it to use kernel CCA while also presenting the semi-supervised matching problem where some initial seed pairs are given. Later, Tripathi et al. (2011) extended the CCA-based formulation to setups where the task is to find a consensus of multiple matching problem solutions, and Sysi-Aho et al. (2011) applied it to finding correspondence between metabolic profiles of different species. The basic idea is to merge several matching problem solutions by learning one more LAP to find a permutation that best agrees with the initial solutions.

The other category of dependency-maximizing matching solutions builds on the kernelized sorting (KS) idea initially presented by Jebara (2004). Instead of finding an explicit representation that allows computing distances (and hence solving the matching as a LAP), the idea in kernelized sorting is to directly optimize a dependency measure that only depends on kernels computed for the two sets. Quadrianto et al. (2010) introduced the standard KS algorithm that maximizes the Hilbert-Schmidt Independence Criterion. The HSIC is defined as the trace-norm of KL, where K and L are properly centered kernel matrices for the two sets, and KS solves the matching problem by introducing a permutation matrix in that cost. The resulting cost corresponds to a quadratic assignment problem (QAP), which is NP complete (Burkard 1984). Quadrianto et al. (2010) solve the QAP by iteratively applying a LAP solver. Later Jagarlamudi et al. (2010) improved the algorithm by proposing an improved initialization scheme and various other tricks to improve the robustness of KS solutions. They also consider application-specific details for an important domain of matching problems, natural language processing with specific tasks such as document alignment. Yamada and Sugiyama (2011) introduced another variant that replaces the HSIC measure with alternative kernel-based dependency measures, normalized cross-covariance operator and least-squares mutual information, using the same iterative learning algorithm as Quadrianto et al. (2010).

Recently, Djuric et al. (2012) provided an alternative optimization algorithm for kernelized sorting. Instead of directly optimizing the QAP, they relax the optimization problem by replacing the optimization space of permutation matrices with that of the doubly-stochastic matrices; positive matrices with unit row and column sums. The resulting cost is convex and hence they can find the global optimum of the relaxed cost. However, the solution does not in general correspond with the solution of the original cost function, and even though the algorithm does not need to solve assignment problems it is still computationally very demanding as it performs constrained gradient-based optimization over a N 2-dimensional parameter space. Also, while the model produces soft assignments between the samples, it does not correspond to a proper distribution over the permutations; it is merely the optimal solution over doubly-stochastic matrices. In the empirical sections, we will compare the proposed models primarily with CKS, since Djuric et al. (2012) showed that it typically outperforms other KS algorithms.

As mentioned already in the Introduction, the MCCA method by Haghighi et al. (2008) can also be interpreted as maximizing the joint likelihood of the two data sets, based on the probabilistic interpretation of CCA (Bach and Jordan 2005). In fact, the MCCA model is identical to our formulation (1) except that it does not have priors for any of the model parameters and it uses maximum likelihood estimation. The model by Tripathi et al. (2011) can also be interpreted in similar fashion, since the maximum likelihood solution for probabilistic CCA is equivalent to the classical CCA solution. These two methods are hence the closest alternatives to ours, and in the empirical comparisons we will use them to demonstrate that the improved accuracy of the Bayesian solutions is because of the posterior inference. For running these comparisons we use the optimization algorithm as implemented by Tripathi et al. (2011) since it does not require initial seed pairing, and use the abbreviation CCA-ML to remind that the method corresponds to maximum likelihood estimation of a CCA-based matching method.

Another example of a method maximizing the join likelihood is the multilingual topic model by Boyd-Graber and Blei (2009). They learn a topic model for two languages by matching their vocabularies based on a maximum a posteriori estimation of a permutation matrix. While their eventual task is in learning the topic model itself, the matching solution is an integral part of the model.

Besides the above two criteria (maximal dependency and optimal joint model) for defining the right match, one can solve the matching problem also by explicitly constructing a distance between the two sets. This results in a non-iterative algorithm that merely needs to compute the distances once, since given the distances a single LAP solver will find the match. For example, Tripathi et al. (2011) used the manifold alignment method by Wang and Mahadevan (2009) to compute the distances by aligning local neighborhoods for the two sets in order to solve the matching problem. The application of Wang and Mahadevan (2009) also constitutes a good example of a potential application domain; they seek to match protein structures.

Some related work has also been done on posterior inference over permutations. While these works have not considered the matching problem as such, they are relevant background information. Kondor et al. (2007) considered exact variational inference over permutations by using Fourier transformations, and Plis et al. (2011) mapped the permutations to a high-dimensional hypersphere to do the same. These approaches are, however, only applicable to small N, at most tens, and hence would not be sufficient for our scenarios where N is in the order of hundreds. Leskovec et al. (2010), in turn, proposed a Metropolis sampler for permutations based on swaps of pairs. Their sampler is effectively equivalent to the one Gibbs-subset uses for sampling the posteriors, assuming we use J=2; for larger J we can consider more complex operations than mere swaps of two pairs. They also provide illustrative characterizations of the properties of the permutation space, showing how only a tiny fraction of the permutations have non-negligible probability; this matches exactly our findings.

## Method summary

Since the article describes two different inference algorithms and a number of variants for both, we will here summarize the previous sections by naming the different alternatives. After the summary, we provide the computational complexities for the variants and also for the closest competitors described in the previous section.

1. 1.

Gibbs sampling with the most likely permutation (Gibbs-hard): Gibbs sampling for all other parameters except the permutation and the latent variables. The permutations are sampled with approximative Metropolis-Hastings step that jointly samples π and Z given the rest of the variables. The sampler is approximative since it accepts all proposals as if using Gibbs proposals, despite actually picking the most likely permutation given the latent variables and the rest of the parameters.

2. 2.

Gibbs sampling with subset updates (Gibbs-subset): Gibbs sampling is used for all parameters. For sampling the permutation we select a random subset of J samples and draw the permutation corresponding to those samples from the true posterior, enumerating all possible permutations of the elements. In empirical experiments we used J=4 and drew the values for 100 randomly chosen subsets for each posterior sample.

3. 3.

VB with the most likely permutation (VB-hard): Variational approximation for all parameters, except the permutation. For the permutation, we simply use the most likely permutation $$\hat{\boldsymbol{\pi}}$$. This variant is equivalent to the comparison method CCA-ML, except that is does posterior inference over the CCA part instead of maximum likelihood.

4. 4.

VB with local permutations (VB-local): Variational approximation for all parameters. The most likely permutation is complemented with a set of other feasible permutations {π (m)}, obtained by re-optimizing the match with extra constraints that prevent each of the samples in turn from picking its favorite pair. We then compute 〈π〉 as a weighted average of these permutations.

5. 5.

VB with numerical integration (VB-numInt): Variational approximation for all parameters. For estimating q(π) we numerically integrate over q(z i ) to model the dependencies between the match and the latent variables. For each Z (m) drawn from the posterior we solve the assignment problem to obtain a feasible permutation π (m). The expectation is given by the flat average of such permutations.

### Computational cost

The computational complexity for one iteration of Gibbs-hard, VB-numInt and VB-hard is $$\mathcal{O} (N^{3} + N^{2} D K + N D K^{3})$$, where D=max(D 1,D 2). The first term is because of solving the LAP, the second for computing the costs of all possible pairs, and the last is for updating the parameters of the BCCA model. Here K is typically small compared to N and D. For VB-local the first term becomes $$\mathcal{O} (N^{4})$$, since it needs to solve the LAP N times for each iteration. Gibbs-subset does not require solving LAP but instead it enumerates all permutations of size J, and hence the complexity is $$\mathcal{O} (J! + N^{2} D K + N D K^{3})$$.

Even though the computational complexity is the same for most of the methods, the practical running times go up for the more accurate approximations. Gibbs-numInt needs to solve the LAP M times, and hence takes roughly M times longer than VB-hard since the LAP-step dominates the cost for all but very small N. To somewhat reduce the computational load, we update the match only after every 10 iterations; the permutations are anyway fairly stable. One iteration for Gibbs-hard is roughly as efficient as one iteration of VB-hard, but one typically needs to run the sampler for much longer than the VB algorithm that often converges in tens of iterations. For the practical experiments we used 1,000 samples, making Gibbs-hard roughly as fast as VB-numInt and VB-local that require less iterations but solve LAP several times per iteration.

The computational complexity of the proposed methods is comparable to that of all of the competing methods. With the exception of CKS, all of the kernelized sorting methods and the CCA-based methods require repeatedly solving a LAP, which is the most time-consuming part in typical applications. Hence, each iteration takes the same amount of time as one iteration of our algorithms and the practical computation time depends on the number of iterations required for convergence. In practice, all methods are applicable to problems of similar magnitude, at least for hundreds of samples and possibly thousands with clever implementation. However, it is worth noting that for solving MCCA and CCA-ML one needs to invert a covariance matrix, introducing an additional complexity term of $$\mathcal{O} (D^{3})$$; our models avoid this by modeling the correlations with explicit components.

CKS does not need to solve LAPs, but instead performs gradient-based optimization over a N 2-dimensional parameter space. There is no easy way to quantify the number of iterations needed for convergence, but computing the gradient as described by Djuric et al. (2012) is $$\mathcal{O} (N^{4})$$ and hence at least for large N the iterations become slower than solving a LAP and the method is not applicable to as large problems as the competing methods. With the increased computational cost comes the advantage of guaranteed global optimum. In Sect. 9.4 we demonstrate how this advantage can be borrowed for the proposed methods by initializing the algorithms with the results of CKS.

## Experiments

We start the experiment with an artificial data experiment, used to demonstrate the characteristics of the solution. In particular, we will show how the two alternative inference strategies have very different strength and weaknesses. We also demonstrate empirically the quality of the approximation for the Gibbs-hard sampler.

After the demonstration, we compare the proposed methods with earlier matching problem solutions, using data collections analyzed by the earlier authors. We perform three different comparisons. The first is an image matching task from Quadrianto et al. (2010), the second a metabolite matching task from Tripathi et al. (2011), and the last a cross-lingual document alignment task from Djuric et al. (2012). In all cases we compare the proposed methods with the leading kernelized sorting variant CKS by Djuric et al. (2012) and the CCA-ML method by Tripathi et al. (2011) which correspond to finding the maximum likelihood solution of our model. The purpose of these comparisons is to show that the proposed solution is more accurate than the earlier solutions, while also demonstrating that the improvement comes from the Bayesian treatment of the model.

### Artificial data

In this section we will apply the model on simple artificial data sets of varying dimensionality, to illustrate an important property of the inference strategies. We generated data sets with N=40 samples and D x and D y ranging from 10 to 640, by sampling data from the model (2) that has four latent variables. We then applied all model variants to these matching problems, initializing them with a permutation that has 50 % correct matches to simulate a reasonably good starting point. The resulting accuracies, averaged over 20 different data sets of each size, are shown in Fig. 2 (top), displaying an interesting trend: for low dimensionality the VB algorithms are the best, but for high dimensionality Gibbs-hard is clearly superior.

The reason for this in the shape of the posterior distribution over the permutations: for high-dimensional data it is peaked around the best permutation, whereas for low-dimensional data it is extremely wide; nearly all permutations are possible. This is illustrated in Fig. 2 (bottom), which shows for each of the dimensionalities the probability of the most likely permutation, computed for data with N=8; for such a small set we can explicitly numerate all of the 40,320 possible permutations and hence can compute the actual normalized probability. We see that when the dimensionality grows, the probability assigned for the most likely permutation gets larger. For low dimensionality, the posterior is very flat, but already for D=120 the posterior is so peaked that the approximations of Gibbs-hard, VB-numInt and VB-local become accurate.

Next, we will explain why the variational approximation and the Gibbs sampler behave very differently for these two scenarios. Let us consider the high-dimensional case first. The Gibbs-hard does well because the assumption that p(π|Ξ,rest) corresponds to the most likely permutation is good, yet the sampler still explores the space of permutations effectively because Ξ is resampled every time. Gibbs-subset shows similar trend of improved accuracy for higher dimensions, but it is considerably less efficient in exploring the posterior since it does not find the best permutation for each sample but instead produces permutations with high degree of autocorrelation. This results also in clearly lower accuracy. The variational approximation, on the other hand, is a mean-field algorithm that explicitly averages over the permutations and the latent variables. Hence, for a peaked posterior the VB-local and VB-hard quickly get stuck with one permutation and the model converges to a local optimum. The VB-numInt model does better because it borrows the strength of the Gibbs-sampler; it numerically integrates over Ξ when updating the permutation, and hence can explore the space of permutations to some degree.

For the low-dimensional case the true posterior is very wide. The Gibbs-hard no longer approximates the posterior well, but even ignoring this issue we have a more fundamental problem with both samplers: Since the posterior over permutations is so wide, it becomes very difficult to estimate the rest of the parameters well. When the sampler, correctly, changes the permutation dramatically from one sample to another, it becomes nearly impossible for W and other parameters to converge towards reasonable posterior. The variational approximation, however, is inefficient in exploring this wide posterior since it averages over the possible values. Hence, the inference technique acts as a strong regularizer, making it possible to infer the rest of the parameters even though the true posterior over the matches is very wide.

Finally, we illustrate empirically the approximation error caused by the incorrect acceptance probability for the (π,Ξ) proposals in Gibbs-hard. Using a data set with N=8 samples we ran both an exact Gibbs sampler and the proposed algorithm for 10,000 independent samples and estimated the marginal posterior p(π|rest) based on the posterior samples, keeping rest of the parameters except π and Ξ fixed. Figure 3 cross-plots the log-probabilities of the two distributions for various data dimensionalities. These plots suggest that despite making a seemingly crude approximation, the Gibbs-hard sampler still produces samples from almost the correct distribution, especially for high-dimensional data. It gives somewhat too high probability for the most likely permutation, but it still gives non-zero probability mass for almost all of the same permutations as the exact sampler, and it also retains the relative probabilities of the permutations accurately.

### Image matching

In this problem the task is to match two halves of a set of 320 images, using the raw pixels values (40×40 pixels in Lab color space) as the input. The problem itself is completely artificial, but it has nevertheless become a kind of benchmark for the matching solutions due to the data provided by Quadrianto et al. (2010). The data has 2400 dimensions, and hence constitutes an example of a high-dimensional data for which the sampling algorithms should do well. VB-local, on the other hand, would not notably differ from VB-hard since the posterior is so peaked around the best permutation, and hence we leave it out from the comparison.

We solve the matching problem with varying subsets of the data. For VB-hard and VB-numInt, we learn L=50 different initial models for each choice of N and initialize the final model by the consensus of these matches, using K=8 components to keep the computational cost manageable. We then initialize the Gibbs variants with the result of VB-numInt, and use K=16 components. We ran the samplers for 10 parallel chains, for 500 samples each, and then found the consensus of all posterior samples; since the initialization was already a good one we did not leave a burn-in period out. For Gibbs-subset we used J=4 and 100 subset choices for each posterior sample.

Figure 4 compares the proposed methods with the p-smooth variant of kernelized sorting by Jagarlamudi et al. (2010), convex kernelized sorting by Djuric et al. (2012), and least-squares object matching (LSOM) by Yamada and Sugiyama (2011), all of which have been demonstrated to be superior to the original kernelized sorting algorithm by Quadrianto et al. (2010). In addition, we compare the proposed methods with CCA-ML, which corresponds to using (hard) EM algorithm to find the maximum likelihood solution of our model. For CCA-ML we used an initialization strategy similar to what was used for the proposed methods. That is, we ran the model L=50 times with different initializations that were slightly randomly permuted PCA-initializations. The final accuracy is the accuracy of the consensus; we also tried running the model one more time using the consensus as initialization but it typically decreased the accuracy. To avoid overfitting to the high-dimensional data, the CCA-ML method was ran on the first N/8 PCA components of each data set.

The main finding is that Gibbs-hard is the best matching solution for this data, followed by LSOM. For the whole collection with N=320 images Gibbs-hard gets 275 correct matches compared to 136 for p-smooth and 206 for CKS;Footnote 1 Yamada and Sugiyama (2011) do not report an exact number for LSOM, but extrapolation suggests it would find roughly 245 correct pairs. The variational Bayesian inference is also good as long as we use numerical integration for estimating q(π); it reaches accuracy comparable to CKS while outperforming p-smooth clearly for large sample sizes. The initialization scheme is necessary for achieving this; the individual runs used for finding the initialization only found on average less than 30 correct pairs for N=320, whereas the final run initialized with their consensus reached 201.

Gibbs-subset, which was initialized with the output of VB-numInt, produces effectively the same results as its initialization; this confirms that the sampler is too inefficient in exploring the permutations space. Considerably more samples would be required to improve the results, but since Gibbs-hard works so much better we did not spend excess computational time to do this. We also see that the VB-hard variant that only uses the most likely solution is not sufficient here. This reveals that the good accuracy of Gibbs-hard and VB-numInt is because of the posterior inference over the permutations. However, for large N VB-hard still outperforms CCA-ML, demonstrating that Bayesian inference over the rest of the parameters already helps.

One of the advantages of the Bayesian matching solutions is that in addition to learning the best permutation we can characterize the posterior over the permutations. Convex kernelized sorting can also achieve this to some degree, since it optimizes the HSIC over doubly-stochastic matrices and hence produces soft assignments as a result. Next, we will compare how well the two methods fare in terms of such soft assignments. First we look at recall of the correct pairs, by ordering for each sample x i the samples in Y according to the posterior probability of matching with x i . Figure 5 (left) shows how already 95 % of true pairs are captured within top 5 ranks. For comparison, Djuric et al. (2012) reports 81 % for the same threshold. We also inspected the actual probabilities to verify that the posterior is a reasonable distribution, and that they are consistent with the actual results. Figure 5 (right) plots the probabilities of the correct matches against the highest probabilities assigned for any pair. We see that the probabilities cover the whole range from roughly 0.1 to one, indicating that the algorithm is more certain of some pairs. We also note that it makes very few mistakes for the pairs that it assigned a high probability, indicating that the values indeed correspond to reasonable probabilities. For comparison, CKS does not assign a weight higher than 0.2 for any pair, illustrating how the soft match learned by CKS cannot be interpreted as any kind of probabilities even though they do sum up to one for each sample; the distributions are clearly too wide to represent the true uncertainty.

### Metabolite matching

Next we proceed to an example data on translational medicine, taken from Tripathi et al. (2011), where the task is to match metabolites of two populations. The problem mimics a challenge where we need to align metabolites of two different species (Sysi-Aho et al. 2011), but here the two populations are both human to provide the ground-truth alignment. The data consists of time series of concentrations of N=53 metabolites, and we have measurements for several subjects. We compare our method with two methods presented by Tripathi et al. (2011), using a setup very similar to theirs. In particular, we average the matching accuracies over 100 runs where X and Y are taken from random subjects (that is, the runs are truly independent since the input data is different in each run), and we restrict the matchings so that a metabolite can only pair with another one in the same functional class (which are assumed known). We also provide another set of results without constraints, to demonstrate how well we can do without any prior information on the match.

The individual time series are of very low dimensionality, ranging from 3 to 30 depending on the subject. Hence, we only apply the variational approximation methods for this problem; the posterior over the permutations is so wide that the Gibbs-sampler variants would not work at all. We then compare our method with CCA-ML and CKS.

Figure 6 shows how we again outperform the earlier methods. VB-numInt and VB-local have comparable accuracy, and both are better than CKS and CCA-ML. The comparison with CCA-ML, which corresponds to the maximum likelihood solution of the proposed model, confirms the findings of the image matching experiment (Sect. 9.2). The Bayesian solution is advantageous in two respects. First, the difference between VB-hard and CCA-ML comes solely from doing Bayesian inference over the CCA parameters, since these two models treat the permutations in identical fashion. More importantly, however, the difference between VB-hard and the other two variants reveals that already approximative Bayesian inference over the permutations improves the accuracy dramatically. For completeness, we tried also the Gibbs samplers for this task, but as expected they did not work; they result in posteriors that are only marginally better than random assignments.

Note that in this experiment we did not use the advanced initialization strategy of learning the final model given a consensus of preliminary runs, but instead only used one initialization (based on the first PCA component) for each run. However, we did one final test to mimic the consensus matching setup of Tripathi et al. (2011), and found the consensus of the 100 runs with different input matrices to reach 85 %, compared to their result of 70 % with equal amount of data and some additional biological constraints not used in our solution.

### Document alignment

As a third real data experiment we consider the task of document alignment. Given two collections of documents written on two different languages, the task is to find the translations by matching the documents. We use the data provided by Djuric et al. (2012), consisting of more than 300 documents extracted from the Europarl corpus and represented as TF-IDF vectors of words stemmed with Snowball.Footnote 2 Djuric et al. (2012) considered nine different matching tasks, each between English documents and documents written in one of nine other languages, and showed that CKS outperforms other kernelized sorting algorithms (the original KS algorithm, KS p-smooth and LSOM) for all tasks by a wide margin. They also achieved effectively perfect accuracy for seven of the tasks, reaching at least 98 % accuracy for each. For the remaining two language pairs, English-Swedish and English-Finnish, their accuracy was only 29 % and 37 %, respectively.

We initialized the Bayesian matching solutions with the permutation learned by CKS and then applied VB-numInt and Gibbs-hard for solving the same matching tasks, using a data representation that kept 10,000 words with the highest total TF-IDF weight over the corpus, separately for each language. For both methods we used K=16, and for Gibbs-hard we again ran 10 separate chains for 500 samples each. We also applied CCA-ML with the same initialization, using D x =D y =50 first PCA components for representing the data. The results are summarized in Table 1, showing how the Bayesian matching solutions and CCA-ML retain the good accuracy for the language pairs CKS already solved adequately. For the two difficult language pairs all methods improve on the initialization, but Gibbs-hard is the only one that solves also those problems perfectly, reaching 100 % accuracy.

### Summary of the empirical experiments

Above we performed four separate experiment to evaluate the Bayesian matching solutions. Based on both the artificial and real matching experiments we can make the following conclusions:

• The proposed Bayesian matching solution outperforms the comparison methods, including kernelized sorting variants and earlier methods based on CCA. In particular, it is considerably more accurate than the maximum-likelihood solutions based on the same idea of introducing a permutation matrix as part of CCA (Haghighi et al. 2008; Tripathi et al. 2011). This confirms that the improved accuracy is because of the full posterior inference, instead of the model structure or cost function.

• For high-dimensional data Gibbs-hard is the best method. It can explore the posterior space more efficiently than the variational approximation, and it produces interpretable posterior estimates with high matching accuracy. While the conditional density used for sampling the permutation is not necessarily exact, the choice of always picking the best permutation is extremely efficient compared to more justified alternatives. As illustrated in Fig. 3, it is still very accurate in producing samples from the correct posterior.

• For low-dimensional data the true posterior over the permutations is so wide that properly modeling it does not produce good results. Hence, the Gibbs samplers do not work for such data. The variational approximations still provide accurate matches due to the inherent regularization effect of mean-field approximation.

• All of the proposed methods depend heavily on the initialization. A good initialization can be obtained by finding a consensus of several matches. Alternatively, the methods can be initialized by the result of the convex kernelized sorting method by Djuric et al. (2012); it finds the global optimum of a relaxation of the kernelized sorting problem and produces good matching accuracy.

The practical suggestion based on these observations is to use the Gibbs-hard method for learning the matching solutions, assuming the data dimensionality is sufficiently high (at least tens, preferably hundreds or more). The method should be initialized either with a consensus learned from multiple random initializations, or with the CKS method. The consensus is best learned with the VB variants, since the samplers might have difficulties with initial solutions where almost all pairs are incorrect; then the posterior is wide irrespective of the dimensionality since most BCCA components do not describe relationships between the two sets. For low-dimensional data, we suggest using the VB-numInt method instead of the samplers.

## Conclusion

We introduced a variational Bayesian solution for the object matching problem introduced by Jebara (2004) and popularized by Haghighi et al. (2008), Quadrianto et al. (2010), Tripathi et al. (2011) and Yamada and Sugiyama (2011) for solving alignment tasks for example in natural language processing and computational biology. By learning together a Bayesian canonical correlation analysis model (Klami et al. 2013) and a permutation matrix re-ordering the samples in one of the sets, we obtained matching accuracies better than those of any earlier solution.

We presented two alternative inference strategies, one based on approximative Gibbs sampling and the other on variational approximation, and derived the computational details necessary for approximating the posterior over the permutations for both. The resulting algorithms were applied on three benchmark data sets and further illustrated on artificially generated data, to confirm that the proposed algorithms produce accurate matches. In particular, we outperformed all the earlier variants by a comfortable margin. For image matching we improved from 64 % to 85 %, for metabolite alignment we improved from 35 % to 39 %, and in two document alignment tasks we improved from 29–37 % to 100 %. These improvements correspond to real practical gains, and in particular the last one represents a qualitative change where the new method is able to perfectly solve a problem for which the earlier solutions were not satisfactory.

The Gibbs sampler was found to be the better of the two inference solutions, since it can more effectively explore the posterior space. We additionally showed that for sufficiently low-dimensional data the true posterior is so wide that it is actually better to concentrate on some local region of the posterior space. For such setups the Gibbs sampler reduces to almost random guessing, and the variational inference is the best matching solution.