Skip to main content
Log in

Continuous similarity search for evolving queries

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we study a novel problem of continuous similarity search for evolving queries. Given a set of objects, each being a set or multiset of items, and a data stream, we want to continuously maintain the top-k most similar objects using the last n items in the stream as an evolving query. We show that the problem has several important applications. At the same time, the problem is challenging. We develop a filtering-based method and a hashing-based method. Our experimental results on both real data sets and synthetic data sets show that our methods are effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.cs.loyola.edu/~cgiannel/assoc_gen.html.

  2. http://fimi.ua.ac.be/data/.

  3. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

References

  1. Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th annual IEEE symposium on foundations of computer science, FOCS ’06, Washington, DC, USA. IEEE Computer Society, pp 459–468

  2. Artin E (2011) Geometric algebra. Wiley, Hoboken

    MATH  Google Scholar 

  3. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, WWW ’07, New York, NY, USA. ACM, pp 131–140

  4. Böhm C, Ooi BC, Plant C, Yan Y (2007) Efficiently processing continuous k-nn queries on data streams. In: Proceedings of the international conference on data engineering, ICDE ’07, Washington, DC, USA. IEEE Computer Society, pp 156–165

  5. Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences 1997, SEQUENCES ’97, Washington, DC, USA. IEEE Computer Society, pp 21–29

  6. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659

    Article  MathSciNet  MATH  Google Scholar 

  7. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, STOC ’02, New York, NY, USA. ACM, pp 380–388

  8. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE ’06, Washington, DC, USA. IEEE Computer Society, pp 5–15

  9. Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, SIGMOD ’98, New York, NY, USA. ACM, pp 201–212

  10. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1):57–78

    Google Scholar 

  11. Datar M, Muthukrishnan S (2002) Estimating rarity and similarity over data stream windows. In: Proceedings of the 10th annual European symposium on algorithms, ESA ’02. Springer, London, pp 323–334

  12. Devroye L, Wagner TJ (1982) 8 nearest neighbor methods in discrimination. Handbook of statistics 2:193–197

    Article  MathSciNet  MATH  Google Scholar 

  13. Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. J Intell Inf Syst 3(3–4):231–262

    Article  Google Scholar 

  14. Faloutsos C, Oard DW (1995) A survey of information retrieval and filtering methods. University of Maryland at College Park, College Park, MD, USA. Univ. of Maryland Institute for Advanced Computer Studies Report

  15. Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image and video content: the qbic system. Computer 28(9):23–32

    Article  Google Scholar 

  16. Gersho A, Gray RM (1991) Vector quantization and signal compression. Kluwer Academic Publishers, Norwell

    MATH  Google Scholar 

  17. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 518–529

  18. Hastie T, Tibshirani R (1995) Discriminant adaptive nearest neighbor classification. In: Proceedings of the first international conference on knowledge discovery and data mining, KDD ’95, Palo Alto, CA, USA. AAAI Press, pp 142–149

  19. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06, New York, NY, USA. ACM, pp 284–291

  20. Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1):84–90

    Article  MathSciNet  MATH  Google Scholar 

  21. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC ’98, New York, NY, USA. ACM, pp 604–613

  22. Koivune V, Kassam S (1995) Nearest neighbor filters for multivariate data. In: IEEE workshop on nonlinear signal and image processing, Washington, DC, USA. IEEE Computer Society, pp 734–737

  23. Kollios G, Tsotras VJ (2002) Hashing methods for temporal data. IEEE Trans Knowl Data Eng 14(4):902–919

    Article  Google Scholar 

  24. Kontaki M, Papadopoulos AN (2004) Efficient similarity search in streaming time sequences. In: Proceedings of the 16th international conference on scientific and statistical database management, SSDBM ’04, Washington, DC, USA. IEEE Computer Society, pp 63–72

  25. Koudas N, Ooi BC, Tan K-L, Zhang R (2004) Approximate nn queries on streams with guaranteed error/performance bounds. In: Proceedings of the thirtieth international conference on very large data bases, VLDB ’04. VLDB Endowment, pp 804–815

  26. Lian X, Chen L, Wang B (2007) Approximate similarity search over multiple stream time series. In: Proceedings of the 12th international conference on database systems for advanced applications, DASFAA’07. Springer, Berlin, pp 962–968

  27. Mouratidis K, Bakiras S, Papadias D (2006) Continuous monitoring of top-k queries over sliding windows. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data, SIGMOD ’06, New York, NY, USA. ACM, pp 635–646

  28. Mouratidis K, Papadias D (2007) Continuous nearest neighbor queries over sliding windows. IEEE Trans Knowl Data Eng 19(6):789–803

    Article  Google Scholar 

  29. Mouratidis K, Papadias D, Bakiras S, Tao Y (2005) A threshold-based algorithm for continuous monitoring of k nearest neighbors. IEEE Trans Knowl Data Eng 17(11):1451–1464

    Article  Google Scholar 

  30. Pan S, Zhu X (2012) Continuous top-k query for graph streams. In Proceedings of the 21st ACM international conference on information and knowledge management. CIKM ’12, New York, NY, USA. ACM, pp 2659–2662

  31. Pentland A, Picard RW, Sclaroff S (1994) Photobook: content-based manipulation of image databases. In: Storage and retrieval for image and video databases, Bellingham, WA, USA. SPIE, pp 34–47

  32. Rao W, Chen L, Chen S, Tarkoma S (2014) Evaluating continuous top-k queries over document streams. World Wide Web 17(1):59–83

    Article  Google Scholar 

  33. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York

    MATH  Google Scholar 

  34. Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, New York, NY, USA. ACM, pp 743–754

  35. Smeulders A, Jain R (eds) (1998) Image databases and multimedia search. World Scientific Publishing Co., Inc., River Edge

    MATH  Google Scholar 

  36. Sun Y, Han J, Yan X, Yu PS, Wu T (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow 4(11):992–1003

    Google Scholar 

  37. Winkler WE (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau, Suitland

    Google Scholar 

  38. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW ’08, New York, NY, USA. ACM, pp 131–140

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Pei.

Additional information

This work is partly supported by an NSERC Discovery grant, the Canada Research Chair program, and a Yahoo! Faculty Research and Engagement Program (FREP) award. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Appendix: Other similarity measures and their upper bounds for pruning method

Appendix: Other similarity measures and their upper bounds for pruning method

In this section, we extend the upper bounds for pruning method to weighted overlap, weighted cosine, and weighted dice similarity. Similar to the case of weighted Jaccard similarity, they all have monotonicity with respect to number of updates u.

Property 1

(A progressive upper bound for weighted overlap similarity). We first define the weighted overlap similarity as follows.

$$\begin{aligned} sim_{over}(X, Y) = sim_{over}(\vec {X}, \vec {Y}) = \sum _{i=1}^{|\Psi |} \min (x_i, y_i) \end{aligned}$$
(8)

Let X, Y be two multisets and \(Y'\) be the multiset with u updates on Y. Given |X|, |Y|, and the weighted overlap similarity score \(sim_{over}(X, Y)\) between X and Y, without the knowledge of the updated elements in \(Y'\), we have an upper bound for \(sim_{over}(X, Y')\).

$$\begin{aligned} sim_{over}(X, Y') \le sim_{over}(X, Y) + u \end{aligned}$$
(9)

Proof

By definition,

$$\begin{aligned} sim_{over}(X, Y) = \sum _{i=1}^{|\Psi |} \min (x_i, y_i) \end{aligned}$$

Obviously, the maximum possible increase after u updates is u. \(\square \)

Property 2

(A progressive upper bound for weighted cosine similarity) We first define the weighted cosine similarity as follows.

$$\begin{aligned} sim_{cos}(X, Y) = sim_{cos}(\vec {X}, \vec {Y}) = \frac{\sum _{i=1}^{|\Psi |} \min (x_i, y_i)}{\sqrt{|X||Y|}} \end{aligned}$$
(10)

Let X, Y be two multisets and \(Y'\) be the multiset with u updates on Y. Given |X|, |Y|, and the weighted cosine similarity score \(sim_{cos}(X, Y)\) between X and Y, without the knowledge of the updated elements in \(Y'\), we have an upper bound for \(sim_{cos}(X, Y')\).

$$\begin{aligned} sim_{cos}(X, Y') \le sim_{cos}(X, Y) + \frac{u}{\sqrt{|X||Y|}} \end{aligned}$$
(11)

Proof

In our scenario, \(|Y| = |Y'|\). Thus, \(\sqrt{|X||Y|} = \sqrt{|X||Y'|}\). By Property 1, we have

$$\begin{aligned} sim_{cos}(X, Y') \le \frac{\sum _{i=1}^{|\Psi |} \min (x_i, y_i) + u}{\sqrt{|X||Y'|}} = sim_{cos}(X, Y) + \frac{u}{\sqrt{|X||Y|}} \end{aligned}$$

\(\square \)

Property 3

(A progressive upper bound for weighted dice similarity) We first define the weighted dice similarity as follows.

$$\begin{aligned} sim_{dice}(X, Y) = sim_{dice}(\vec {X}, \vec {Y}) = \frac{2\sum _{i=1}^{|\Psi |} \min (x_i, y_i)}{|X| + |Y|} \end{aligned}$$
(12)

Let X, Y be two multisets and \(Y'\) be the multiset with u updates on Y. Given |X|, |Y|, and the weighted dice similarity score \(sim_{dice}(X, Y)\) between X and Y, without the knowledge of the updated elements in \(Y'\), we have an upper bound for \(sim_{dice}(X, Y')\).

$$\begin{aligned} sim_{dice}(X, Y') \le sim_{dice}(X, Y) + \frac{2u}{|X| + |Y|} \end{aligned}$$
(13)

Proof

In our scenario, \(|Y| = |Y'|\). Thus, \(|X|+|Y| = |X|+|Y'|\). By Property 1, we have

$$\begin{aligned} sim_{dice}(X, Y') \le \frac{2\sum _{i=1}^{|\Psi |} \min (x_i, y_i) + 2u}{|X|+|Y'|} = sim_{dice}(X, Y) + \frac{2u}{|X|+|Y|} \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, X., Gao, C., Pei, J. et al. Continuous similarity search for evolving queries. Knowl Inf Syst 48, 649–678 (2016). https://doi.org/10.1007/s10115-015-0892-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0892-x

Keywords

Navigation