Efficient Database Search via Tensor Distribution Bucketing

Mongia, Mihir; Soudry, Benjamin; Davoodi, Arash Gholami; Mohimani, Hosein

doi:10.1007/978-3-030-47436-2_26

Mihir Mongia¹⁴,
Benjamin Soudry¹⁴,
Arash Gholami Davoodi¹⁴ &
…
Hosein Mohimani¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12085))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4157 Accesses

Abstract

In mass spectrometry-based proteomics, one needs to search billions of mass spectra against the human proteome with billions of amino acids, where many of the amino acids go through post-translational modifications. In order to account for novel modifications, we need to search all the spectra against all the peptides using a joint probabilistic model that can be learned from training data. Assuming M spectra and N possible peptides, currently the state of the art search methods have runtime of O(MN). Here, we propose a novel bucketing method that sends pairs with high likelihood under the joint probabilistic model to the same bucket with higher probability than those pairs with low likelihood. We demonstrate that the runtime of this method grows sub-linearly with the data size, and our results show that our method is orders of magnitude faster than methods from the locality sensitive hashing literature.

You have full access to this open access chapter, Download conference paper PDF

ForestDSH: a universal hash design for discrete probability distributions

Article 11 February 2021

A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides

Article Open access 30 July 2019

PepQuery2 democratizes public MS proteomics data for rapid peptide searching

Article Open access 18 April 2023

Keywords

1 Introduction

One of the fundamental challenges in mass-spectrometry based proteomics is to identify proteins present in a cell culture by searching their mass spectrometry fingerprint against all the peptides in a proteomic database/reference. When the number of mass spectra and the size of the reference proteome increase, this search becomes very slow, especially in cases where post-translational modifications are allowed.

Given a peptide sequence, the existing methods construct a binary-valued spectra from each peptide, where we have ones at the positions peaks are present and zeros otherwise. Then a probabilistic model is trained to learn the joint probability distribution P(spec, pep) between the predicted spectra and the discretized mass spectra (Fig. 1) [10]. Given a spectra spec and a set of peptides $Pep = \left\{ pep_{1}, pep_{2} \ldots pep_{N} \right\} $, the goal is to find the peptide(s) $pep \in Pep$ that maximize P(spec|pep). This task motivates the following problem.

Database Search Problem In Probabilistic Settings (DPPS).Consider the following maximum likelihood problem that generalizes the problem of matching peptides to spectra. Let $\mathcal {A} = \left\{ a_{1}, a_{2}, \ldots a_{m} \right\} $ and $\mathcal {B}= \left\{ b_{1}, b_{2}, \ldots b_{n} \right\} $ be discrete alphabets where $m, n \in \mathbb {N}$. Let $\mathcal {P}$ be a joint distribution on the alphabets $\mathcal {A}$ and $\mathcal {B}$ such that $\sum _{i=1}^{i=m}\sum _{j=1}^{j=n}\mathcal {P}(a_{i},b_{j}) = 1$. Let $S \in \mathbb {N}$ and $\mathbb {P}(X,Y) = \prod _{s=1}^{s = S}\mathcal {P}(x_{s},y_{s}) $ where $X = (x_{1},x_{2}, \ldots , x_{S})$, $Y = (y_{1},y_{2}, \ldots , y_{S})$ and $x_{s} \in \mathcal {A}$, $y_{s} \in \mathcal {B}$ for $1 \le s \le S$. Given a data point $Y \in \mathcal {B} ^{S}$ and a set of classes^{Footnote 1}$\left\{ X^{1}, \cdots , X^{N} \right\} \subseteq \mathcal {A}^{S}$ our goal is to accurately and efficiently predict the class $X^{t}$, $1 \le t \le N$, that generated Y.

Note in reference to mass-spectrometry based proteomics, the classes $\{X^{1},\cdots ,X^{N}\}$ model the set of peptides $Pep = \left\{ pep_{1}, pep_{2} \ldots pep_{N} \right\} $ and Y models a spectra spec. We address the DPPS problem by solving the following optimization problem:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{{X \in \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} }} \mathbb {P}(Y|X) \end{aligned}$$

(1)

A naive way to solve this optimization problem is to compute $\mathbb {P}(Y|X^{t})$ for each $1\le t \le N$, and find the maximum among them. The complexity of this approach grows linearly with N, and thus is prohibitively slow for practical applications as N grows large.

2 Related Work

As stated above, the issue with the naive way of solving DPPS is that brute force calculation of $\mathbb {P}(Y|X)$ for every $X \in \{X^{1},\cdots ,X^{N}\}$ is a slow algorithm. Another domain where brute force calculation is necessary is nearest neighbor search. In nearest neighbor search, there is a data point $Y \in \mathbb {Re}^{S}$ and a set of points $\left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} \in \mathbb {Re}^{S}$. The goal is to quickly solve the following minimization problem:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{{X \in \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} }} \Vert Y-X\Vert _{2} \end{aligned}$$

where $\Vert Y-X\Vert _{2}$ stands for the Euclidean norm. This problem is equivalent to (1) in the special case where the probability distribution $\mathbb {P}$ is continuous and $P(y_{s}|x_{s}) \sim \mathcal {N}(x_{s}, \sigma )$ , $1 \le s \le S$.

For high-dimensional data, the exact nearest neighbor search problem grows linearly with the number of data points [9]. Therefore, researchers consider the approximate nearest neighbor (ANN) search problem. In the ANN-search problem, the objective is to find $X \in \left\{ X^{1}, X^{2}, \ldots X^{N} \right\} $ such that

$$\begin{aligned} \Vert Y-X\Vert _{2} \le c \Vert Y-X^{*}\Vert _{2} \end{aligned}$$

(2)

where

$$\begin{aligned} X^{*} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{{X \in \left\{ X^{1}, X^{2}, \ldots X^{N} \right\} }} \Vert Y-X\Vert _{2} \end{aligned}$$

(3)

and $c > 1$ is referred to as the “approximation factor”. A common algorithm for solving this problem is locality sensitive hashing [2, 6, 7, 9] (Algorithm 1). This algorithm takes as input hashes h that satisfy the following constraints for some $R > 0$, and $0 < P_{2} \le P_{1} \le 1$:

$$\begin{aligned}&\bullet If \ \Vert Y-X\Vert _{2}\le \ R, \ then \ h(X) \ = \ h(Y) \ with\ probability \ at \ least \ P_{1}\end{aligned}$$

(4)

$$\begin{aligned}&\bullet If \ \Vert Y-X\Vert _{2} \ge \ cR, \ then \ h(X) = h(Y) \ with \ probability \ at \ most\ P_{2} \end{aligned}$$

(5)

where hashes h satisfying (4) and (5) are called $(R,cR,P_{1},P_{2})$ - sensitive hashes. As stated in Gionis et al. [7], “The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.”

The locality sensitive hashing algorithm takes a query point Y and aims to find the point that is the most similar to it in a database. The algorithm does this by first applying r hash functions to the query point in each band j, $1 \le j \le b$. Then in each band, the algorithm considers all points X in the database, that have been hashed to the same values as Y in all of the r hash functions.

Currently, LSH is limited to a number of distance metrics including manhattan distance ($L_{1}$), and euclidean distance ($L_{2}$). In order to use LSH for other similarity measures, one needs to transform them to $L_{1}, L_{2}$ metrics for which standard hashes are known. While it is possible to transform the $\mathbf{DPPS} $ problem to an approximate nearest neighbor search problem with standard metrics, in this paper we show such transformations result in algorithms with suboptimal complexity. In this paper, we design buckets for the DPPS problem that significantly outperform standard LSH algorithms.

We address the DPPS problem by defining pairs of relations (one for each alphabet) that are sensitive to the specific joint distribution that pairs of data points belong. Another distinctive feature of these hash relations, which we refer to as buckets, is that elements in the domain can be mapped to more than one element in the range. We refer to this framework as distribution sensitive bucketing. In distribution sensitive bucketing, the buckets $U^{x} :\mathcal {A}^{S} \mapsto 2^{\left\{ 1,2 \ldots Z \right\} } $, $U^{y} :\mathcal {B}^{S} \mapsto 2^{\left\{ 1,2 \ldots Z \right\} }$ satisfy the following constraints for some $0 \le \beta \le \alpha \le 1$:

$$\begin{aligned} \bullet P(U^{x}(X) \cap U^{y}(Y) \ne \emptyset | (X,Y) \sim \mathbb {P} ) = \alpha \end{aligned}$$

(6)

$$\begin{aligned} \bullet P(U^{x}(X) \cap U^{y}(Y) \ne \emptyset | X \sim \mathbb {P}_{X}, \ Y \sim \mathbb {P}_{Y}, \ X \ and \ Y \ are \ independent ) = \beta \end{aligned}$$

(7)

$$\begin{aligned} \bullet \mathbb {E}|U^{x}(X)| = \delta _{x} \end{aligned}$$

(8)

$$\begin{aligned} \bullet \mathbb {E}|U^{y}(X)| = \delta _{y} \end{aligned}$$

(9)

$$\begin{aligned} \bullet |U^{x}(X) \cap U^{y}(Y)| \le 1 \end{aligned}$$

(10)

Here, $Z \in \mathbb {N}$ and for some set R, we define $2^{R}$ to be the set of all subsets of R. We refer to the buckets that satisfy (6), (7), (8), (9), and (10) as $(\mathcal {P},\mathcal {Q},\alpha ,\beta , \delta _{x}, \delta _{y})$ - sensitive buckets. For $(\mathcal {P},\mathcal {Q},\alpha ,\beta , \delta _{x}, \delta _{y})$ - sensitive buckets, it is the case that (i) the probability of a jointly generated pair being mapped to the same value is $\alpha $, (ii) the probability of random pairs X, Y being mapped to the same value is $\beta $, and (iii) the complexity of assigning points to the buckets is proportional to $\delta _{x}$ and $\delta _{y}$. Thus intuitively, we would like to maximize $\alpha $ while minimizing $\beta $,$\delta _{x}$, and $\delta _{y}$.

The rest of the document will proceed as follows. In Sect. 3, we assume an oracle has given us a family of $(\mathcal {P},\mathcal {Q},\alpha ,\beta , \delta _{x}, \delta _{y})$ - sensitive buckets, and we design an algorithm to solve (1) based on this family. In Sect. 4, we provide a way to construct these buckets. In Sect. 5 we derive the overall complexity of the algorithm presented in Sect. 3 and in Sect. 6, we propose an algorithm for constructing optimal buckets. Finally, in Sect. 7, we detail our experiments on simulated and real mass spectra.

3 Distribution Sensitive Bucketing Algorithm

In the this section we introduce an algorithm for solving (1) when an oracle has given a family of distribution sensitive buckets and we refer to this algorithm as Distribution Sensitive Bucketing.

In contrast to the locality sensitive hashing algorithm that attempts to find the pairs of data points that are very similar to each other, the goal of distribution sensitive bucketing is to find pairs of data points that are jointly generated from a known joint probability distribution. Algorithm 2 describes a procedure to solve (1) using a family of distribution sensitive buckets. Here we use r rows and b bands, and in each band we check whether the query Y has a collision with a data point X in each of the r rows.

4 Constructing Distribution Sensitive Buckets

In the previous section, we assumed an oracle has given us a set of distribution sensitive buckets, and we designed an algorithm to solve (1) using this family of buckets. In this section, we present an approach for constructing distribution sensitive buckets.

Let A and B be arbitrary binary tensors of dimensions $m^{k} \times Z$ and $n^{k} \times Z$, respectively. To construct distribution sensitive buckets, for each pair of buckets we choose positions $1 \le S_{1} \le S_{2} \ldots \le S_{k} \le S$ randomly, and we define buckets $U^{x}_{S_{1},S_{2}, \ldots ,S_{k}}$ and $U^{y}_{S_{1},S_{2}, \ldots ,S_{k}}$ as follows:

$$\begin{aligned}&U^{x}_{S_{1},S_{2}, \ldots ,S_{k}}(X) = \left\{ z\in \left\{ 1,\ldots ,Z \right\} | A_{X_{S_{1}},X_{S_{2}}, \ldots , X_{S_{k}},z} = 1 \right\} \end{aligned}$$

(11)

$$\begin{aligned}&U^{y}_{S_{1},S_{2}, \ldots ,S_{k}}(Y) = \left\{ z\in \left\{ 1,\ldots ,Z \right\} | B_{Y_{S_{1}},Y_{S_{2}}, \ldots , Y_{S_{k}},z} = 1 \right\} \end{aligned}$$

(12)

As there is a straightforward way to convert a tensor to a matrix, in the rest of this paper we treat A and B as $m^{k}$ by Z and $n^{k}$ by Z binary matrices. Furthermore, for any matrix M and $u,v \in \mathbb {N}$ we use the notation M[u, v] to refer to the entry belonging to the uth row and vth column of M. In the rest of this section, we derive $\alpha $, $\beta $, $\delta _{x}$, and $\delta _{y}$ for the proposed family of buckets.

Remark 1

Denote two buckets $U^{x}$ and $U^{y}$ and their corresponding matrices as A and B, respectively. These buckets satisfy (10) if for any row i of A and row j of B there is at most one column c where both $A[i,c] = 1$ and $B[j,c] = 1$. A and B satisfying this constraint are called $\mathbf {non\text{- }intersecting}$.

Theorem 1

Given a pair of bucket $U^{x}$ and $U^{y}$ and a corresponding pair of matrices A and B that satisfy the non-intersecting constraint, it can be shown that $U^{x}$ and $U^{y}$ are $(\mathcal {P},\mathcal {Q},\alpha ,\beta ,\delta _{x},\delta _{y})$-sensitive where

(13)

(14)

$$\begin{aligned} \delta _{x} ={P}_{x}^{k}A\mathbb {I}_{z} \end{aligned}$$

(15)

$$\begin{aligned} \delta _{y} ={P}_{y}^{k}B\mathbb {I}_{z} \end{aligned}$$

(16)

and $\mathbb {I}_{z}$ denotes a vector ones of dimension z

Proof

For a proof of Theorem 1, see Supplementary Section 1.

5 Complexity Analysis

In Sect. 3, we provided Algorithm 2 to solve (1) given a family of Distribution Sensitive Buckets, and in the previous section we constructed a family of Distribution Sensitive Buckets based on $m^{k} \times Z$ and $n^{k} \times Z$ matrices, A and B. In this section, we analyze the expected complexity of Query portion of Algorithm 2 given the generative process defined in in DPPS. We first analyze the complexity given a specific instance of Y and $\mathcal {X} = \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} $, and then derive the expected complexity. In Algorithm 2, the computational work for each query can be broken into (i) searching members of in a hash table containing in order to find positives and (ii) computing $\mathbb {P}(Y|X)$ for all the positives.

Since searching a hash table is O(1) complexity, the computational work of (i) is equivalent to . Computing requires forming the cartesian product $U_{1,j}(Y) \times U_{2,j}(Y) \times , \ldots , \times U_{r,j}(Y)$. Thus, the total size of over b bands can be upper bounded by

$$\begin{aligned} \sum _{j=1}^{b} \prod _{i=1}^{r} |U_{i,j}(Y)| \end{aligned}$$

(17)

The computational work of (ii) is equal to the number of positives. The number of positives can be calculated in the following way:

$$\begin{aligned} \sum _{j=1}^{b}\sum _{x \in \mathcal {X}}\prod _{i=1}^{r}|U_{i,j}(X) \cap U_{i,j}(Y)| \end{aligned}$$

(18)

The expectation of the sum of (17) and (18) given the generative process in DPPS is then given by

$$\begin{aligned} \mathbb {E}\bigg [\sum _{j=1}^{j} \prod _{i=1}^{r} |U_{i,j}(Y)| + \sum _{j=1}^{j}\sum _{x \in \mathcal {X}}\prod _{i=1}^{r}|U_{i,j}(X) \cap U_{i,j}(Y)| \ \bigg | \ X \sim \mathbb {P}_{x}, Y \sim \mathbb {P}_{y} \bigg ] = bN\beta ^{r} + b \delta _{y}^{r} \end{aligned}$$

(19)

Note that here we assumed that almost all of the pairs are independently generated. This assumption is due to the fact that only one $X \in \mathcal {X}$ is responsible for generating Y. Now the question is how do we select b? The probability of a jointly generated X, Y pair being called a positive (i.e. for at least one j), which we refer to as the $True \ Positive \ Rate$, can be calculated in the following way:

$$\begin{aligned} 1 - \prod _{j=1}^{b} \bigg ( 1 - \prod _{i=1}^{i=r}P\Big (U^{x}_{i,j} \cap U^{y}_{i,j}(Y) \ne \emptyset \Big ) \bigg ) = 1 - (1-\alpha ^{r})^{b} \ge 1 - (e^{-\alpha ^{r}})^{b} \end{aligned}$$

(20)

We usually want to maintain a true positive rate of nearly 1, e.g. $True \ Positive \ Rate \ge 1 - \epsilon $ where $\epsilon $ is a small number. This can be realized by setting $(e^{-\alpha ^{r}})^{b} \le \epsilon $, i.e.

$$\begin{aligned} b \ge \frac{-\ln \epsilon }{\alpha ^{r}}, \end{aligned}$$

(21)

Therefore, the overall expected complexity given the generative process in DPPS, can be upper bounded by the following expression^{Footnote 2} and $\mathcal {X}$ is already preprocessed:

(22)

6 Designing Distribution Sensitive Buckets

In Sects. 4 and 5, we described how to construct a family of distributive sensitive buckets using matrices A and B, and we derived the complexity of Algorithm 2 based on these matrices. In this section we present an approach to find the optimal matrices based on integer linear programming.

6.1 Integer Linear Programming Method

Algorithm 3 presents a integer linear programming approach for finding matrices A and B that optimize the complexity (22) using (13), (14), (15), and (16). Our approach is based on the assumption that matrix A is identity. The reason behind this assumption is that any matrix B is non-intersecting with the identity matrix.

We often need to design buckets for larger values of k in order to design more efficient algorithms for solving (1). However, the size of $\mathcal {P}^{k}$ grows exponentially with k and thus Algorithm 3 will not run efficiently for $k > 10$. Thus, we use Algorithm 4 to filter $\mathcal {P}^k$ to a smaller matrix $\mathcal {P}^{k}_{\epsilon }$, which only keeps the rows and columns in $\mathcal {P}^k$ with sum above $\epsilon $, and then pass this matrix, which we denote as $\mathcal {P}^{k}_{\epsilon }$, as an input to Algorithm 3.

7 Experiments

In this section we verify the advantage of our Distribution Sensitive Bucketing approach with several experiments. In the first experiment, we compare the performance of Distribution Sensitive Bucketing to several commonly used methods in the Locality Sensitive Hashing literature on a range of theoretical distributions $\mathcal {P}$. Although these methods are not directly applicable to our problem, we can transform the DPPS problem into problems where these methods work. In the second experiment, we apply the Distribution Sensitive Bucketing algorithm along with the same methods from the Locality Sensitive Hashing literature to the problem of peptide identification from mass spectrometry signals. In this problem, given millions of peptides $\mathcal {X} = \left\{ X^{1}, X^{2}, X^{3} \ldots X^{N} \right\} $ and a discretized mass spectrum Y, our goal is to find a peptide $X \in \mathcal {X}$ that maximizes $\mathbb {P}(Y|X) = \prod _{s=1}^{s = S}\mathcal {P}(y_{s}|x_{s})$ where $\mathcal {P}$ can be learned from a training data set of known peptide-spectra pairs. We use the probabilistic model introduced in Kim et al. [10], which is trained to score a mass spectra against a peptide sequence accounting for neutral losses and the intensity of peaks (see Supplementary Figure 1).

7.1 Experiment 1. Theoretical Complexity

In this experiment we compared the complexity of 3 algorithms - LSH-Hamming, Inner Product Hash [13], and Distribution Sensitive Bucketing - on a range of probability distributions. LSH-Hamming and Inner Product Search can not be directly applied to the DPPS problem as LSH-Hamming and Inner Product Hash only work when $\{X^{1},\cdots , X^{N}\}$ and Y both belong to the same alphabet. Nevertheless, through the following transformations, the DPPS problem can be transformed to a nearest neighbor search problem with hamming distance and the maximum inner product search problem. To transform the DPPS problem to nearest neighbor search with hamming distance, map each element in the alphabets $\mathcal {A} = \{a_{1}, \cdots , a_{m}\}$ and $\mathcal {B} = \{b_{1},\cdots , b_{n}\}$ to either 0 or 1. As a result, for each $ X \in \{X^{1},,X^{N}\}$, $X \in \{0,1\}^{S}$ and $Y \in \{0,1\}^{S}$. To transform DPPS to the maximum inner product search problem, first change the objective function to $log(\mathbb {P}(Y|X)) = \sum _{s = 1}^{s= S}log(\mathcal {P}(y_{s}|x_{s})$. Observe that $log(\mathcal {P}(y_{s}|x_{s}))$ can be expressed as the dot product of a one hot vector of size n (the size of the alphabet $\mathcal {B}$) with a vector $log(\mathcal {P}(y|x_{s}) \in \mathbb {R}^{n}$. Now we can concatenate all the vectors (one for each $1 \le s \le S$) into signals $v_{Y}$ and $w_{X}$ of length Sn. The dot product of these two vectors will be $log(\mathbb {P}(Y|X))$. Thus one can then apply maximum inner product search (MIPS) to the set $\mathcal {X}^{'} = \{w_{X}|X \in \{X^{1}, \cdots , X^{N}\} \}$ and query vector $v_{Y}$ in order to solve the DPPS problem.

We benchmark the algorithms using probability distribution $\mathcal {P}(t) = \mathcal {P}_{1}t + \mathcal {P}_{2}(1-t)$ for different values of $0\le t \le 1$ where $\mathcal {P}_{1} = \begin{bmatrix} .25 &{} .25 \\ .25 &{} .25 \end{bmatrix}$ and $\mathcal {P}_{2} = \begin{bmatrix} .95 &{} .01 \\ .03 &{} .01 \end{bmatrix}$. Since LSH-Hamming’s performance depends on the particular mappings of the orginal alphabets to $\{0,1\}$, we do an exhaustive search over all mappings and use the best mapping for each $\mathcal {P}(t)$.

In Fig. 2, we plot the asymptotic complexity of each of the three algorithms. The asymptotic complexity can be expressed as $O(N^{\lambda })$, for some $\lambda $. The value of $\lambda $ is plotted versus t in Fig. 2. As one can see, Distribution Sensitive Bucketing has a lower asymptotic complexity than LSH-Hamming for $0 \le t\le .5$ and for $t \ge .5$ Distribution Sensitive Bucketing and LSH-Hamming have the same asymptotic complexity. Inner Product Hash is always worse than Distribution Sensitive Bucketing by a large margin.

7.2 Experiment 2

In this experiment we evaluated the performance of Distribution Sensitive Bucketing on simulated spectra and peptides. We simulated the mass spectra and peptides using the probabilistic model from Kim et al. [10]. For each of the algorithms, we choose the parameters so that they theoretically achieve a 95% true positive rate. In Figure 2 of the Supplementary, we verify experimentally that we indeed achieve a 95% True Positive Rate. Figure 3 shows the number of positive calls for our method in comparison to the brute force method, LSH-Hamming, and Inner Product Hash. Here, we changed the number of peptides from 100 to 100,000 and computed the average number of positives per spectrum averaged over 5,000 spectra. Figure 4 shows the runtime of brute force search, LSH-Hamming, and Inner Product Hash versus Distribution Sensitive Bucketing. Distribution Sensitive Bucketing is 20X faster than brute force while LSH - Hamming is only 2X faster than brute force. Inner Product Hash is always as slow as brute force search.

7.3 Experiment 3 - Mass Spectrometry Database Search in Proteomics

We applied Distribution Sensitive Bucketing, LSH-Hamming, Inner Product Hash, and brute force search to the problem of mass spectrometry database search in proteomics. Here we search a dataset of 93,587 spectra against the human proteome sequence. We tune the parameters of Distribution Sensitive Bucketing, LSH-Hamming, and Inner Product Hash on a smaller test data set to get a 95% True Positive rate and then apply the algorithms on the larger data set. For distribution sensitive bucketing, this resulted in a 91% true positive rate and 50X decrease in the number of positives in comparison to brute force search. Distribution Sensitive Bucketing also led to 30X reduction in time compared to brute force search. LSH-Hamming resulted in a 2X reduction in positives and a 2X reduction in computation time while achieving a True Positive Rate of 93%. Inner Product Hash did not improve on brute force search.

8 Conclusion

In this paper we introduce a problem from computational biology which requires computing a joint likelihood of all pairs of data points coming from two separate large data sets. In order to speed up this brute force procedure, we develop a novel bucketing method. We show theoretically and experimentally our method is superior to methods from the locality sensitive literature.

Notes

1.
$ X^{1}, X^{2}, X^{3} \ldots X^{N}$ are explained by the marginal distribution $\mathbb {P}_{x}$.
2.
Note that $\delta _{x}$ is not present in this expression. This is because we are considering the situation we search Y against $\mathcal {X} = \{X^{1},\cdots ,X^{N}\}$.

References

Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 459–468. IEEE (2006)
Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117 (2008)
Article Google Scholar
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM (JACM) 45(6), 891–923 (1998)
Article MathSciNet Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)
Google Scholar
Christiani, T., Pagh, R.: Set similarity search beyond MinHash. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1094–1107. ACM (2017)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM (2004)
Google Scholar
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)
Google Scholar
Indyk, P.: Approximate algorithms for high-dimensional geometric problems. In: Invited Talk at DIMACS Workshop on Computational Geometry (2001)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Kim, S., Pevzner, P.A.: MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 5, 5277 (2014)
Article Google Scholar
Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D., Nesvizhskii, A.I.: MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14(5), 513 (2017)
Article Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Neyshabur, B., Srebro, N.: On symmetric and asymmetric LSHS for inner product search. arXiv preprint arXiv:1410.5518 (2014)
Shrivastava, A., Li, P.: In defense of MinHash over SimHash. In: Artificial Intelligence and Statistics, pp. 886–894 (2014)
Google Scholar

Download references

Acknowledgements

This work was supported by a research fellowship from the Alfred P. Sloan Foundation and a National Institutes of Health New Innovator Award (DP2GM137413).

Author information

Authors and Affiliations

Carnegie Mellon University School of Computer Science, Pittsburgh, USA
Mihir Mongia, Benjamin Soudry, Arash Gholami Davoodi & Hosein Mohimani

Authors

Mihir Mongia
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Soudry
View author publications
You can also search for this author in PubMed Google Scholar
Arash Gholami Davoodi
View author publications
You can also search for this author in PubMed Google Scholar
Hosein Mohimani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mihir Mongia .

Editor information

Editors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
Hady W. Lauw
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Raymond Chi-Wing Wong
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
Alexandros Ntoulas
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Institute of Data Science, National University of Singapore, Singapore, Singapore
See-Kiong Ng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Sinno Jialin Pan

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 200 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mongia, M., Soudry, B., Davoodi, A.G., Mohimani, H. (2020). Efficient Database Search via Tensor Distribution Bucketing. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12085. Springer, Cham. https://doi.org/10.1007/978-3-030-47436-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-47436-2_26
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47435-5
Online ISBN: 978-3-030-47436-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics