Keywords

1 Introduction

One of the fundamental challenges in mass-spectrometry based proteomics is to identify proteins present in a cell culture by searching their mass spectrometry fingerprint against all the peptides in a proteomic database/reference. When the number of mass spectra and the size of the reference proteome increase, this search becomes very slow, especially in cases where post-translational modifications are allowed.

Given a peptide sequence, the existing methods construct a binary-valued spectra from each peptide, where we have ones at the positions peaks are present and zeros otherwise. Then a probabilistic model is trained to learn the joint probability distribution P(specpep) between the predicted spectra and the discretized mass spectra (Fig. 1) [10]. Given a spectra spec and a set of peptides \(Pep = \left\{ pep_{1}, pep_{2} \ldots pep_{N} \right\} \), the goal is to find the peptide(s) \(pep \in Pep\) that maximize P(spec|pep). This task motivates the following problem.

Fig. 1.
figure 1

Predicted spectra of three peptide FAG, KLT, and AMR are shown. Their mass spectra are shown in the bottom. The joint probability distribution between the predicted and mass spectra of the peptides, can be learned.

Database Search Problem In Probabilistic Settings (DPPS).Consider the following maximum likelihood problem that generalizes the problem of matching peptides to spectra. Let \(\mathcal {A} = \left\{ a_{1}, a_{2}, \ldots a_{m} \right\} \) and \(\mathcal {B}= \left\{ b_{1}, b_{2}, \ldots b_{n} \right\} \) be discrete alphabets where \(m, n \in \mathbb {N}\). Let \(\mathcal {P}\) be a joint distribution on the alphabets \(\mathcal {A}\) and \(\mathcal {B}\) such that \(\sum _{i=1}^{i=m}\sum _{j=1}^{j=n}\mathcal {P}(a_{i},b_{j}) = 1\). Let \(S \in \mathbb {N}\) and \(\mathbb {P}(X,Y) = \prod _{s=1}^{s = S}\mathcal {P}(x_{s},y_{s}) \) where \(X = (x_{1},x_{2}, \ldots , x_{S})\), \(Y = (y_{1},y_{2}, \ldots , y_{S})\) and \(x_{s} \in \mathcal {A}\), \(y_{s} \in \mathcal {B}\) for \(1 \le s \le S\). Given a data point \(Y \in \mathcal {B} ^{S}\) and a set of classesFootnote 1\(\left\{ X^{1}, \cdots , X^{N} \right\} \subseteq \mathcal {A}^{S}\) our goal is to accurately and efficiently predict the class \(X^{t}\), \(1 \le t \le N\), that generated Y.

Note in reference to mass-spectrometry based proteomics, the classes \(\{X^{1},\cdots ,X^{N}\}\) model the set of peptides \(Pep = \left\{ pep_{1}, pep_{2} \ldots pep_{N} \right\} \) and Y models a spectra spec. We address the DPPS problem by solving the following optimization problem:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{{X \in \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} }} \mathbb {P}(Y|X) \end{aligned}$$
(1)

A naive way to solve this optimization problem is to compute \(\mathbb {P}(Y|X^{t})\) for each \(1\le t \le N\), and find the maximum among them. The complexity of this approach grows linearly with N, and thus is prohibitively slow for practical applications as N grows large.

2 Related Work

As stated above, the issue with the naive way of solving DPPS is that brute force calculation of \(\mathbb {P}(Y|X)\) for every \(X \in \{X^{1},\cdots ,X^{N}\}\) is a slow algorithm. Another domain where brute force calculation is necessary is nearest neighbor search. In nearest neighbor search, there is a data point \(Y \in \mathbb {Re}^{S}\) and a set of points \(\left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} \in \mathbb {Re}^{S}\). The goal is to quickly solve the following minimization problem:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{{X \in \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} }} \Vert Y-X\Vert _{2} \end{aligned}$$

where \(\Vert Y-X\Vert _{2}\) stands for the Euclidean norm. This problem is equivalent to (1) in the special case where the probability distribution \(\mathbb {P}\) is continuous and \(P(y_{s}|x_{s}) \sim \mathcal {N}(x_{s}, \sigma )\) , \(1 \le s \le S\).

For high-dimensional data, the exact nearest neighbor search problem grows linearly with the number of data points [9]. Therefore, researchers consider the approximate nearest neighbor (ANN) search problem. In the ANN-search problem, the objective is to find \(X \in \left\{ X^{1}, X^{2}, \ldots X^{N} \right\} \) such that

$$\begin{aligned} \Vert Y-X\Vert _{2} \le c \Vert Y-X^{*}\Vert _{2} \end{aligned}$$
(2)

where

$$\begin{aligned} X^{*} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{{X \in \left\{ X^{1}, X^{2}, \ldots X^{N} \right\} }} \Vert Y-X\Vert _{2} \end{aligned}$$
(3)

and \(c > 1\) is referred to as the “approximation factor”. A common algorithm for solving this problem is locality sensitive hashing [2, 6, 7, 9] (Algorithm 1). This algorithm takes as input hashes h that satisfy the following constraints for some \(R > 0\), and \(0 < P_{2} \le P_{1} \le 1\):

$$\begin{aligned}&\bullet If \ \Vert Y-X\Vert _{2}\le \ R, \ then \ h(X) \ = \ h(Y) \ with\ probability \ at \ least \ P_{1}\end{aligned}$$
(4)
$$\begin{aligned}&\bullet If \ \Vert Y-X\Vert _{2} \ge \ cR, \ then \ h(X) = h(Y) \ with \ probability \ at \ most\ P_{2} \end{aligned}$$
(5)

where hashes h satisfying (4) and (5) are called \((R,cR,P_{1},P_{2})\) - sensitive hashes. As stated in Gionis et al. [7], “The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.”

figure a

The locality sensitive hashing algorithm takes a query point Y and aims to find the point that is the most similar to it in a database. The algorithm does this by first applying r hash functions to the query point in each band j, \(1 \le j \le b\). Then in each band, the algorithm considers all points X in the database, that have been hashed to the same values as Y in all of the r hash functions.

Currently, LSH is limited to a number of distance metrics including manhattan distance (\(L_{1}\)), and euclidean distance (\(L_{2}\)). In order to use LSH for other similarity measures, one needs to transform them to \(L_{1}, L_{2}\) metrics for which standard hashes are known. While it is possible to transform the \(\mathbf{DPPS} \) problem to an approximate nearest neighbor search problem with standard metrics, in this paper we show such transformations result in algorithms with suboptimal complexity. In this paper, we design buckets for the DPPS problem that significantly outperform standard LSH algorithms.

We address the DPPS problem by defining pairs of relations (one for each alphabet) that are sensitive to the specific joint distribution that pairs of data points belong. Another distinctive feature of these hash relations, which we refer to as buckets, is that elements in the domain can be mapped to more than one element in the range. We refer to this framework as distribution sensitive bucketing. In distribution sensitive bucketing, the buckets \(U^{x} :\mathcal {A}^{S} \mapsto 2^{\left\{ 1,2 \ldots Z \right\} } \), \(U^{y} :\mathcal {B}^{S} \mapsto 2^{\left\{ 1,2 \ldots Z \right\} }\) satisfy the following constraints for some \(0 \le \beta \le \alpha \le 1\):

$$\begin{aligned} \bullet P(U^{x}(X) \cap U^{y}(Y) \ne \emptyset | (X,Y) \sim \mathbb {P} ) = \alpha \end{aligned}$$
(6)
$$\begin{aligned} \bullet P(U^{x}(X) \cap U^{y}(Y) \ne \emptyset | X \sim \mathbb {P}_{X}, \ Y \sim \mathbb {P}_{Y}, \ X \ and \ Y \ are \ independent ) = \beta \end{aligned}$$
(7)
$$\begin{aligned} \bullet \mathbb {E}|U^{x}(X)| = \delta _{x} \end{aligned}$$
(8)
$$\begin{aligned} \bullet \mathbb {E}|U^{y}(X)| = \delta _{y} \end{aligned}$$
(9)
$$\begin{aligned} \bullet |U^{x}(X) \cap U^{y}(Y)| \le 1 \end{aligned}$$
(10)

Here, \(Z \in \mathbb {N}\) and for some set R, we define \(2^{R}\) to be the set of all subsets of R. We refer to the buckets that satisfy (6), (7), (8), (9), and (10) as \((\mathcal {P},\mathcal {Q},\alpha ,\beta , \delta _{x}, \delta _{y})\) - sensitive buckets. For \((\mathcal {P},\mathcal {Q},\alpha ,\beta , \delta _{x}, \delta _{y})\) - sensitive buckets, it is the case that (i) the probability of a jointly generated pair being mapped to the same value is \(\alpha \), (ii) the probability of random pairs XY being mapped to the same value is \(\beta \), and (iii) the complexity of assigning points to the buckets is proportional to \(\delta _{x}\) and \(\delta _{y}\). Thus intuitively, we would like to maximize \(\alpha \) while minimizing \(\beta \),\(\delta _{x}\), and \(\delta _{y}\).

The rest of the document will proceed as follows. In Sect. 3, we assume an oracle has given us a family of \((\mathcal {P},\mathcal {Q},\alpha ,\beta , \delta _{x}, \delta _{y})\) - sensitive buckets, and we design an algorithm to solve (1) based on this family. In Sect. 4, we provide a way to construct these buckets. In Sect. 5 we derive the overall complexity of the algorithm presented in Sect. 3 and in Sect. 6, we propose an algorithm for constructing optimal buckets. Finally, in Sect. 7, we detail our experiments on simulated and real mass spectra.

3 Distribution Sensitive Bucketing Algorithm

In the this section we introduce an algorithm for solving (1) when an oracle has given a family of distribution sensitive buckets and we refer to this algorithm as Distribution Sensitive Bucketing.

In contrast to the locality sensitive hashing algorithm that attempts to find the pairs of data points that are very similar to each other, the goal of distribution sensitive bucketing is to find pairs of data points that are jointly generated from a known joint probability distribution. Algorithm 2 describes a procedure to solve (1) using a family of distribution sensitive buckets. Here we use r rows and b bands, and in each band we check whether the query Y has a collision with a data point X in each of the r rows.

figure b

4 Constructing Distribution Sensitive Buckets

In the previous section, we assumed an oracle has given us a set of distribution sensitive buckets, and we designed an algorithm to solve (1) using this family of buckets. In this section, we present an approach for constructing distribution sensitive buckets.

Let A and B be arbitrary binary tensors of dimensions \(m^{k} \times Z\) and \(n^{k} \times Z\), respectively. To construct distribution sensitive buckets, for each pair of buckets we choose positions \(1 \le S_{1} \le S_{2} \ldots \le S_{k} \le S\) randomly, and we define buckets \(U^{x}_{S_{1},S_{2}, \ldots ,S_{k}}\) and \(U^{y}_{S_{1},S_{2}, \ldots ,S_{k}}\) as follows:

$$\begin{aligned}&U^{x}_{S_{1},S_{2}, \ldots ,S_{k}}(X) = \left\{ z\in \left\{ 1,\ldots ,Z \right\} | A_{X_{S_{1}},X_{S_{2}}, \ldots , X_{S_{k}},z} = 1 \right\} \end{aligned}$$
(11)
$$\begin{aligned}&U^{y}_{S_{1},S_{2}, \ldots ,S_{k}}(Y) = \left\{ z\in \left\{ 1,\ldots ,Z \right\} | B_{Y_{S_{1}},Y_{S_{2}}, \ldots , Y_{S_{k}},z} = 1 \right\} \end{aligned}$$
(12)

As there is a straightforward way to convert a tensor to a matrix, in the rest of this paper we treat A and B as \(m^{k}\) by Z and \(n^{k}\) by Z binary matrices. Furthermore, for any matrix M and \(u,v \in \mathbb {N}\) we use the notation M[uv] to refer to the entry belonging to the uth row and vth column of M. In the rest of this section, we derive \(\alpha \), \(\beta \), \(\delta _{x}\), and \(\delta _{y}\) for the proposed family of buckets.

Remark 1

Denote two buckets \(U^{x}\) and \(U^{y}\) and their corresponding matrices as A and B, respectively. These buckets satisfy (10) if for any row i of A and row j of B there is at most one column c where both \(A[i,c] = 1\) and \(B[j,c] = 1\). A and B satisfying this constraint are called \(\mathbf {non\text{- }intersecting}\).

Theorem 1

Given a pair of bucket \(U^{x}\) and \(U^{y}\) and a corresponding pair of matrices A and B that satisfy the non-intersecting constraint, it can be shown that \(U^{x}\) and \(U^{y}\) are \((\mathcal {P},\mathcal {Q},\alpha ,\beta ,\delta _{x},\delta _{y})\)-sensitive where

(13)
(14)
$$\begin{aligned} \delta _{x} ={P}_{x}^{k}A\mathbb {I}_{z} \end{aligned}$$
(15)
$$\begin{aligned} \delta _{y} ={P}_{y}^{k}B\mathbb {I}_{z} \end{aligned}$$
(16)

and \(\mathbb {I}_{z}\) denotes a vector ones of dimension z

Proof

For a proof of Theorem 1, see Supplementary Section 1.

5 Complexity Analysis

In Sect. 3, we provided Algorithm 2 to solve (1) given a family of Distribution Sensitive Buckets, and in the previous section we constructed a family of Distribution Sensitive Buckets based on \(m^{k} \times Z\) and \(n^{k} \times Z\) matrices, A and B. In this section, we analyze the expected complexity of Query portion of Algorithm 2 given the generative process defined in in DPPS. We first analyze the complexity given a specific instance of Y and \(\mathcal {X} = \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} \), and then derive the expected complexity. In Algorithm 2, the computational work for each query can be broken into (i) searching members of in a hash table containing in order to find positives and (ii) computing \(\mathbb {P}(Y|X)\) for all the positives.

Since searching a hash table is O(1) complexity, the computational work of (i) is equivalent to . Computing requires forming the cartesian product \(U_{1,j}(Y) \times U_{2,j}(Y) \times , \ldots , \times U_{r,j}(Y)\). Thus, the total size of over b bands can be upper bounded by

$$\begin{aligned} \sum _{j=1}^{b} \prod _{i=1}^{r} |U_{i,j}(Y)| \end{aligned}$$
(17)

The computational work of (ii) is equal to the number of positives. The number of positives can be calculated in the following way:

$$\begin{aligned} \sum _{j=1}^{b}\sum _{x \in \mathcal {X}}\prod _{i=1}^{r}|U_{i,j}(X) \cap U_{i,j}(Y)| \end{aligned}$$
(18)

The expectation of the sum of (17) and (18) given the generative process in DPPS is then given by

$$\begin{aligned} \mathbb {E}\bigg [\sum _{j=1}^{j} \prod _{i=1}^{r} |U_{i,j}(Y)| + \sum _{j=1}^{j}\sum _{x \in \mathcal {X}}\prod _{i=1}^{r}|U_{i,j}(X) \cap U_{i,j}(Y)| \ \bigg | \ X \sim \mathbb {P}_{x}, Y \sim \mathbb {P}_{y} \bigg ] = bN\beta ^{r} + b \delta _{y}^{r} \end{aligned}$$
(19)

Note that here we assumed that almost all of the pairs are independently generated. This assumption is due to the fact that only one \(X \in \mathcal {X}\) is responsible for generating Y. Now the question is how do we select b? The probability of a jointly generated XY pair being called a positive (i.e. for at least one j), which we refer to as the \(True \ Positive \ Rate\), can be calculated in the following way:

$$\begin{aligned} 1 - \prod _{j=1}^{b} \bigg ( 1 - \prod _{i=1}^{i=r}P\Big (U^{x}_{i,j} \cap U^{y}_{i,j}(Y) \ne \emptyset \Big ) \bigg ) = 1 - (1-\alpha ^{r})^{b} \ge 1 - (e^{-\alpha ^{r}})^{b} \end{aligned}$$
(20)

We usually want to maintain a true positive rate of nearly 1, e.g. \(True \ Positive \ Rate \ge 1 - \epsilon \) where \(\epsilon \) is a small number. This can be realized by setting \((e^{-\alpha ^{r}})^{b} \le \epsilon \), i.e.

$$\begin{aligned} b \ge \frac{-\ln \epsilon }{\alpha ^{r}}, \end{aligned}$$
(21)

Therefore, the overall expected complexity given the generative process in DPPS, can be upper bounded by the following expressionFootnote 2 and \(\mathcal {X}\) is already preprocessed:

(22)

6 Designing Distribution Sensitive Buckets

In Sects. 4 and 5, we described how to construct a family of distributive sensitive buckets using matrices A and B, and we derived the complexity of Algorithm 2 based on these matrices. In this section we present an approach to find the optimal matrices based on integer linear programming.

6.1 Integer Linear Programming Method

Algorithm 3 presents a integer linear programming approach for finding matrices A and B that optimize the complexity (22) using (13), (14), (15), and (16). Our approach is based on the assumption that matrix A is identity. The reason behind this assumption is that any matrix B is non-intersecting with the identity matrix.

figure c

We often need to design buckets for larger values of k in order to design more efficient algorithms for solving (1). However, the size of \(\mathcal {P}^{k}\) grows exponentially with k and thus Algorithm 3 will not run efficiently for \(k > 10\). Thus, we use Algorithm 4 to filter \(\mathcal {P}^k\) to a smaller matrix \(\mathcal {P}^{k}_{\epsilon }\), which only keeps the rows and columns in \(\mathcal {P}^k\) with sum above \(\epsilon \), and then pass this matrix, which we denote as \(\mathcal {P}^{k}_{\epsilon }\), as an input to Algorithm 3.

figure d

7 Experiments

In this section we verify the advantage of our Distribution Sensitive Bucketing approach with several experiments. In the first experiment, we compare the performance of Distribution Sensitive Bucketing to several commonly used methods in the Locality Sensitive Hashing literature on a range of theoretical distributions \(\mathcal {P}\). Although these methods are not directly applicable to our problem, we can transform the DPPS problem into problems where these methods work. In the second experiment, we apply the Distribution Sensitive Bucketing algorithm along with the same methods from the Locality Sensitive Hashing literature to the problem of peptide identification from mass spectrometry signals. In this problem, given millions of peptides \(\mathcal {X} = \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} \) and a discretized mass spectrum Y, our goal is to find a peptide \(X \in \mathcal {X}\) that maximizes \(\mathbb {P}(Y|X) = \prod _{s=1}^{s = S}\mathcal {P}(y_{s}|x_{s})\) where \(\mathcal {P}\) can be learned from a training data set of known peptide-spectra pairs. We use the probabilistic model introduced in Kim et al. [10], which is trained to score a mass spectra against a peptide sequence accounting for neutral losses and the intensity of peaks (see Supplementary Figure 1).

7.1 Experiment 1. Theoretical Complexity

In this experiment we compared the complexity of 3 algorithms - LSH-Hamming, Inner Product Hash [13], and Distribution Sensitive Bucketing - on a range of probability distributions. LSH-Hamming and Inner Product Search can not be directly applied to the DPPS problem as LSH-Hamming and Inner Product Hash only work when \(\{X^{1},\cdots , X^{N}\}\) and Y both belong to the same alphabet. Nevertheless, through the following transformations, the DPPS problem can be transformed to a nearest neighbor search problem with hamming distance and the maximum inner product search problem. To transform the DPPS problem to nearest neighbor search with hamming distance, map each element in the alphabets \(\mathcal {A} = \{a_{1}, \cdots , a_{m}\}\) and \(\mathcal {B} = \{b_{1},\cdots , b_{n}\}\) to either 0 or 1. As a result, for each \( X \in \{X^{1},,X^{N}\}\), \(X \in \{0,1\}^{S}\) and \(Y \in \{0,1\}^{S}\). To transform DPPS to the maximum inner product search problem, first change the objective function to \(log(\mathbb {P}(Y|X)) = \sum _{s = 1}^{s= S}log(\mathcal {P}(y_{s}|x_{s})\). Observe that \(log(\mathcal {P}(y_{s}|x_{s}))\) can be expressed as the dot product of a one hot vector of size n (the size of the alphabet \(\mathcal {B}\)) with a vector \(log(\mathcal {P}(y|x_{s}) \in \mathbb {R}^{n}\). Now we can concatenate all the vectors (one for each \(1 \le s \le S\)) into signals \(v_{Y}\) and \(w_{X}\) of length Sn. The dot product of these two vectors will be \(log(\mathbb {P}(Y|X))\). Thus one can then apply maximum inner product search (MIPS) to the set \(\mathcal {X}^{'} = \{w_{X}|X \in \{X^{1}, \cdots , X^{N}\} \}\) and query vector \(v_{Y}\) in order to solve the DPPS problem.

We benchmark the algorithms using probability distribution \(\mathcal {P}(t) = \mathcal {P}_{1}t + \mathcal {P}_{2}(1-t)\) for different values of \(0\le t \le 1\) where \(\mathcal {P}_{1} = \begin{bmatrix} .25 &{} .25 \\ .25 &{} .25 \end{bmatrix}\) and \(\mathcal {P}_{2} = \begin{bmatrix} .95 &{} .01 \\ .03 &{} .01 \end{bmatrix}\). Since LSH-Hamming’s performance depends on the particular mappings of the orginal alphabets to \(\{0,1\}\), we do an exhaustive search over all mappings and use the best mapping for each \(\mathcal {P}(t)\).

In Fig. 2, we plot the asymptotic complexity of each of the three algorithms. The asymptotic complexity can be expressed as \(O(N^{\lambda })\), for some \(\lambda \). The value of \(\lambda \) is plotted versus t in Fig. 2. As one can see, Distribution Sensitive Bucketing has a lower asymptotic complexity than LSH-Hamming for \(0 \le t\le .5\) and for \(t \ge .5\) Distribution Sensitive Bucketing and LSH-Hamming have the same asymptotic complexity. Inner Product Hash is always worse than Distribution Sensitive Bucketing by a large margin.

7.2 Experiment 2

In this experiment we evaluated the performance of Distribution Sensitive Bucketing on simulated spectra and peptides. We simulated the mass spectra and peptides using the probabilistic model from Kim et al. [10]. For each of the algorithms, we choose the parameters so that they theoretically achieve a 95% true positive rate. In Figure 2 of the Supplementary, we verify experimentally that we indeed achieve a 95% True Positive Rate. Figure 3 shows the number of positive calls for our method in comparison to the brute force method, LSH-Hamming, and Inner Product Hash. Here, we changed the number of peptides from 100 to 100,000 and computed the average number of positives per spectrum averaged over 5,000 spectra. Figure 4 shows the runtime of brute force search, LSH-Hamming, and Inner Product Hash versus Distribution Sensitive Bucketing. Distribution Sensitive Bucketing is 20X faster than brute force while LSH - Hamming is only 2X faster than brute force. Inner Product Hash is always as slow as brute force search.

7.3 Experiment 3 - Mass Spectrometry Database Search in Proteomics

We applied Distribution Sensitive Bucketing, LSH-Hamming, Inner Product Hash, and brute force search to the problem of mass spectrometry database search in proteomics. Here we search a dataset of 93,587 spectra against the human proteome sequence. We tune the parameters of Distribution Sensitive Bucketing, LSH-Hamming, and Inner Product Hash on a smaller test data set to get a 95% True Positive rate and then apply the algorithms on the larger data set. For distribution sensitive bucketing, this resulted in a 91% true positive rate and 50X decrease in the number of positives in comparison to brute force search. Distribution Sensitive Bucketing also led to 30X reduction in time compared to brute force search. LSH-Hamming resulted in a 2X reduction in positives and a 2X reduction in computation time while achieving a True Positive Rate of 93%. Inner Product Hash did not improve on brute force search.

Fig. 2.
figure 2

For each algorithm we plot the theoretical asymptotic complexity of search for different values of t, which corresponds to different mixtures of distributions \(\mathcal {P}_{1}\) and \(\mathcal {P}_{2}\). By theoretical asymptotic complexity we mean the value \(\lambda \) for which the complexity of the algorithm is \(O(N^{\lambda })\)

Fig. 3.
figure 3

The empirical number of positive calls by brute force search, Distribution Sensitive Bucketing, LSH-Hamming, and Inner Product Hash. Note in this case that Brute Force and Inner Product Hash perform the same. Distribution Sensitive Bucketing has 20X less positive calls than brute force search and 10X less positive calls than LSH-Hamming.

Fig. 4.
figure 4

The run time of Distribution Sensitive Bucketing (Algorithm 2), in comparison to brute force, LSH-Hamming, and Inner Product Hash. Distribution Sensitive Bucketing has a 20X reduction in time compared brute force. LSH-Hamming has a reduction of 2X compared to brute force.

8 Conclusion

In this paper we introduce a problem from computational biology which requires computing a joint likelihood of all pairs of data points coming from two separate large data sets. In order to speed up this brute force procedure, we develop a novel bucketing method. We show theoretically and experimentally our method is superior to methods from the locality sensitive literature.