Continuous similarity search for evolving queries

Xu, Xiaoning; Gao, Chuancong; Pei, Jian; Wang, Ke; Al-Barakati, Abdullah

doi:10.1007/s10115-015-0892-x

Continuous similarity search for evolving queries

Regular Paper
Published: 15 October 2015

Volume 48, pages 649–678, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Xiaoning Xu¹,
Chuancong Gao²,
Jian Pei^2,3,
Ke Wang² &
…
Abdullah Al-Barakati³

385 Accesses
6 Citations
Explore all metrics

Abstract

In this paper, we study a novel problem of continuous similarity search for evolving queries. Given a set of objects, each being a set or multiset of items, and a data stream, we want to continuously maintain the top-k most similar objects using the last n items in the stream as an evolving query. We show that the problem has several important applications. At the same time, the problem is challenging. We develop a filtering-based method and a hashing-based method. Our experimental results on both real data sets and synthetic data sets show that our methods are effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous Similarity Search for Evolving Database

Fast Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries

Continuous Similarity Search for Text Sets

Notes

References

Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th annual IEEE symposium on foundations of computer science, FOCS ’06, Washington, DC, USA. IEEE Computer Society, pp 459–468
Artin E (2011) Geometric algebra. Wiley, Hoboken
MATH Google Scholar
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, WWW ’07, New York, NY, USA. ACM, pp 131–140
Böhm C, Ooi BC, Plant C, Yan Y (2007) Efficiently processing continuous k-nn queries on data streams. In: Proceedings of the international conference on data engineering, ICDE ’07, Washington, DC, USA. IEEE Computer Society, pp 156–165
Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences 1997, SEQUENCES ’97, Washington, DC, USA. IEEE Computer Society, pp 21–29
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659
Article MathSciNet MATH Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, STOC ’02, New York, NY, USA. ACM, pp 380–388
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE ’06, Washington, DC, USA. IEEE Computer Society, pp 5–15
Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, SIGMOD ’98, New York, NY, USA. ACM, pp 201–212
Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1):57–78
Google Scholar
Datar M, Muthukrishnan S (2002) Estimating rarity and similarity over data stream windows. In: Proceedings of the 10th annual European symposium on algorithms, ESA ’02. Springer, London, pp 323–334
Devroye L, Wagner TJ (1982) 8 nearest neighbor methods in discrimination. Handbook of statistics 2:193–197
Article MathSciNet MATH Google Scholar
Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. J Intell Inf Syst 3(3–4):231–262
Article Google Scholar
Faloutsos C, Oard DW (1995) A survey of information retrieval and filtering methods. University of Maryland at College Park, College Park, MD, USA. Univ. of Maryland Institute for Advanced Computer Studies Report
Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image and video content: the qbic system. Computer 28(9):23–32
Article Google Scholar
Gersho A, Gray RM (1991) Vector quantization and signal compression. Kluwer Academic Publishers, Norwell
MATH Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 518–529
Hastie T, Tibshirani R (1995) Discriminant adaptive nearest neighbor classification. In: Proceedings of the first international conference on knowledge discovery and data mining, KDD ’95, Palo Alto, CA, USA. AAAI Press, pp 142–149
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06, New York, NY, USA. ACM, pp 284–291
Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1):84–90
Article MathSciNet MATH Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC ’98, New York, NY, USA. ACM, pp 604–613
Koivune V, Kassam S (1995) Nearest neighbor filters for multivariate data. In: IEEE workshop on nonlinear signal and image processing, Washington, DC, USA. IEEE Computer Society, pp 734–737
Kollios G, Tsotras VJ (2002) Hashing methods for temporal data. IEEE Trans Knowl Data Eng 14(4):902–919
Article Google Scholar
Kontaki M, Papadopoulos AN (2004) Efficient similarity search in streaming time sequences. In: Proceedings of the 16th international conference on scientific and statistical database management, SSDBM ’04, Washington, DC, USA. IEEE Computer Society, pp 63–72
Koudas N, Ooi BC, Tan K-L, Zhang R (2004) Approximate nn queries on streams with guaranteed error/performance bounds. In: Proceedings of the thirtieth international conference on very large data bases, VLDB ’04. VLDB Endowment, pp 804–815
Lian X, Chen L, Wang B (2007) Approximate similarity search over multiple stream time series. In: Proceedings of the 12th international conference on database systems for advanced applications, DASFAA’07. Springer, Berlin, pp 962–968
Mouratidis K, Bakiras S, Papadias D (2006) Continuous monitoring of top-k queries over sliding windows. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data, SIGMOD ’06, New York, NY, USA. ACM, pp 635–646
Mouratidis K, Papadias D (2007) Continuous nearest neighbor queries over sliding windows. IEEE Trans Knowl Data Eng 19(6):789–803
Article Google Scholar
Mouratidis K, Papadias D, Bakiras S, Tao Y (2005) A threshold-based algorithm for continuous monitoring of k nearest neighbors. IEEE Trans Knowl Data Eng 17(11):1451–1464
Article Google Scholar
Pan S, Zhu X (2012) Continuous top-k query for graph streams. In Proceedings of the 21st ACM international conference on information and knowledge management. CIKM ’12, New York, NY, USA. ACM, pp 2659–2662
Pentland A, Picard RW, Sclaroff S (1994) Photobook: content-based manipulation of image databases. In: Storage and retrieval for image and video databases, Bellingham, WA, USA. SPIE, pp 34–47
Rao W, Chen L, Chen S, Tarkoma S (2014) Evaluating continuous top-k queries over document streams. World Wide Web 17(1):59–83
Article Google Scholar
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York
MATH Google Scholar
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, New York, NY, USA. ACM, pp 743–754
Smeulders A, Jain R (eds) (1998) Image databases and multimedia search. World Scientific Publishing Co., Inc., River Edge
MATH Google Scholar
Sun Y, Han J, Yan X, Yu PS, Wu T (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow 4(11):992–1003
Google Scholar
Winkler WE (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau, Suitland
Google Scholar
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW ’08, New York, NY, USA. ACM, pp 131–140

Download references

Author information

Authors and Affiliations

Fortinet Inc., Burnaby, BC, Canada
Xiaoning Xu
Simon Fraser University, Burnaby, BC, Canada
Chuancong Gao, Jian Pei & Ke Wang
King Abdulaziz University, Jeddah, Saudi Arabia
Jian Pei & Abdullah Al-Barakati

Authors

Xiaoning Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chuancong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jian Pei
View author publications
You can also search for this author in PubMed Google Scholar
Ke Wang
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah Al-Barakati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Pei.

Additional information

This work is partly supported by an NSERC Discovery grant, the Canada Research Chair program, and a Yahoo! Faculty Research and Engagement Program (FREP) award. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Appendix: Other similarity measures and their upper bounds for pruning method

In this section, we extend the upper bounds for pruning method to weighted overlap, weighted cosine, and weighted dice similarity. Similar to the case of weighted Jaccard similarity, they all have monotonicity with respect to number of updates u.

Property 1

(A progressive upper bound for weighted overlap similarity). We first define the weighted overlap similarity as follows.

$$\begin{aligned} sim_{over}(X, Y) = sim_{over}(\vec {X}, \vec {Y}) = \sum _{i=1}^{|\Psi |} \min (x_i, y_i) \end{aligned}$$

(8)

Let X, Y be two multisets and $Y'$ be the multiset with u updates on Y. Given |X|, |Y|, and the weighted overlap similarity score $sim_{over}(X, Y)$ between X and Y, without the knowledge of the updated elements in $Y'$, we have an upper bound for $sim_{over}(X, Y')$.

$$\begin{aligned} sim_{over}(X, Y') \le sim_{over}(X, Y) + u \end{aligned}$$

(9)

Proof

By definition,

$$\begin{aligned} sim_{over}(X, Y) = \sum _{i=1}^{|\Psi |} \min (x_i, y_i) \end{aligned}$$

Obviously, the maximum possible increase after u updates is u. $\square $

Property 2

(A progressive upper bound for weighted cosine similarity) We first define the weighted cosine similarity as follows.

$$\begin{aligned} sim_{cos}(X, Y) = sim_{cos}(\vec {X}, \vec {Y}) = \frac{\sum _{i=1}^{|\Psi |} \min (x_i, y_i)}{\sqrt{|X||Y|}} \end{aligned}$$

(10)

Let X, Y be two multisets and $Y'$ be the multiset with u updates on Y. Given |X|, |Y|, and the weighted cosine similarity score $sim_{cos}(X, Y)$ between X and Y, without the knowledge of the updated elements in $Y'$, we have an upper bound for $sim_{cos}(X, Y')$.

$$\begin{aligned} sim_{cos}(X, Y') \le sim_{cos}(X, Y) + \frac{u}{\sqrt{|X||Y|}} \end{aligned}$$

(11)

Proof

In our scenario, $|Y| = |Y'|$. Thus, $\sqrt{|X||Y|} = \sqrt{|X||Y'|}$. By Property 1, we have

$$\begin{aligned} sim_{cos}(X, Y') \le \frac{\sum _{i=1}^{|\Psi |} \min (x_i, y_i) + u}{\sqrt{|X||Y'|}} = sim_{cos}(X, Y) + \frac{u}{\sqrt{|X||Y|}} \end{aligned}$$

$\square $

Property 3

(A progressive upper bound for weighted dice similarity) We first define the weighted dice similarity as follows.

$$\begin{aligned} sim_{dice}(X, Y) = sim_{dice}(\vec {X}, \vec {Y}) = \frac{2\sum _{i=1}^{|\Psi |} \min (x_i, y_i)}{|X| + |Y|} \end{aligned}$$

(12)

Let X, Y be two multisets and $Y'$ be the multiset with u updates on Y. Given |X|, |Y|, and the weighted dice similarity score $sim_{dice}(X, Y)$ between X and Y, without the knowledge of the updated elements in $Y'$, we have an upper bound for $sim_{dice}(X, Y')$.

$$\begin{aligned} sim_{dice}(X, Y') \le sim_{dice}(X, Y) + \frac{2u}{|X| + |Y|} \end{aligned}$$

(13)

Proof

In our scenario, $|Y| = |Y'|$. Thus, $|X|+|Y| = |X|+|Y'|$. By Property 1, we have

$$\begin{aligned} sim_{dice}(X, Y') \le \frac{2\sum _{i=1}^{|\Psi |} \min (x_i, y_i) + 2u}{|X|+|Y'|} = sim_{dice}(X, Y) + \frac{2u}{|X|+|Y|} \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, X., Gao, C., Pei, J. et al. Continuous similarity search for evolving queries. Knowl Inf Syst 48, 649–678 (2016). https://doi.org/10.1007/s10115-015-0892-x

Download citation

Received: 14 May 2015
Revised: 03 August 2015
Accepted: 05 October 2015
Published: 15 October 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10115-015-0892-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous similarity search for evolving queries

Abstract

Access this article

Similar content being viewed by others

Continuous Similarity Search for Evolving Database

Fast Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries

Continuous Similarity Search for Text Sets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Other similarity measures and their upper bounds for pruning method

Property 1

Proof

Property 2

Proof

Property 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Continuous similarity search for evolving queries

Abstract

Access this article

Similar content being viewed by others

Continuous Similarity Search for Evolving Database

Fast Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries

Continuous Similarity Search for Text Sets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Other similarity measures and their upper bounds for pruning method

Appendix: Other similarity measures and their upper bounds for pruning method

Property 1

Proof

Property 2

Proof

Property 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation