Theory of Computing Systems

, 46:104 | Cite as

Maximal Intersection Queries in Randomized Input Models

  • Benjamin Hoffmann
  • Mikhail Lifshits
  • Yury Lifshits
  • Dirk Nowotka
Article

Abstract

Consider a family of sets and a single set, called the query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maximal intersection queries arise in a wide range of applications, including web search, recommendation systems, and distributing on-line advertisements. In general, maximal intersection queries are computationally expensive. We investigate two well-motivated distributions over all families of sets and propose an algorithm for each of them. We show that with very high probability an almost optimal solution is found in time which is logarithmic in the size of the family. Moreover, we point out a threshold phenomenon on the probabilities of intersecting sets in each of our two input models which leads to the efficient algorithms mentioned above.

Keywords

Nearest neighbor problem Randomized input models Zipf’s law Maximal intersection problem Algorithms for large data sets 

References

  1. 1.
    Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33(3), 322–373 (2001) CrossRefGoogle Scholar
  2. 2.
    Broder, A.: On the resemblance and containment of documents. In: SEQUENCES ’97: Proceedings of the Compression and Complexity of Sequences 1997, Washington, DC, USA, p. 21. IEEE Comput. Soc., Los Alamitos (1997) Google Scholar
  3. 3.
    Bruck, J., Naor, M.: The hardness of decoding linear codes with preprocessing. IEEE Trans. Inf. Theory 36(2), 381–385 (1990) MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001) CrossRefGoogle Scholar
  5. 5.
    Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2006) Google Scholar
  6. 6.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC ’04: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, New York, NY, USA, pp. 91–100. Assoc. Comput. Mach., New York (2004) CrossRefGoogle Scholar
  7. 7.
    Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 301–312. Assoc. Comput. Mach., New York (2003) CrossRefGoogle Scholar
  8. 8.
    Goyal, N., Lifshits, Y., Schütze, H.: Disorder inequality: a combinatorial approach to nearest neighbor search. In: WSDM, pp. 25–32 (2008) Google Scholar
  9. 9.
    Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst. 28(4), 517–580 (2003) CrossRefGoogle Scholar
  10. 10.
    Hoffmann, B., Lifshits, Y., Nowotka, D.: Maximal intersection queries in randomized graph models. In: Diekert, V., Volkov, M.V., Voronkov, A. (eds.) CSR. Lecture Notes in Computer Science, vol. 4649, pp. 227–236. Springer, Berlin (2007) Google Scholar
  11. 11.
    Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, 2nd edn. CRC Press, Boca Raton (2004). Chap. 39 Google Scholar
  12. 12.
    Karger, D.R., Ruhl, M.: Finding nearest neighbors in growth-restricted metrics. In: STOC ’02: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, New York, NY, USA, pp. 741–750. Assoc. Comput. Mach., New York (2002) CrossRefGoogle Scholar
  13. 13.
    Kleinberg, J.M.: Two algorithms for nearest-neighbor search in high dimensions. In: STOC ’97: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, New York, NY, USA, pp. 599–608. Assoc. Comput. Mach., New York (1997) CrossRefGoogle Scholar
  14. 14.
    Krauthgamer, R., Lee, J.R.: Navigating nets: simple algorithms for proximity search. In: SODA ’04: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, pp. 798–807. SIAM, Philadelphia (2004) Google Scholar
  15. 15.
    Lifshits, Y., Nowotka, D.: Estimation of the click volume by large scale regression analysis. In: Diekert, V., Volkov, M.V., Voronkov, A. (eds.) CSR. Lecture Notes in Computer Science, vol. 4649, pp. 216–226. Springer, Berlin (2007) Google Scholar
  16. 16.
    Maaß, M.G., Nowak, J.: A new method for approximate indexing and dictionarylookup with one error. Inf. Process. Lett. 96(5), 185–191 (2005) CrossRefGoogle Scholar
  17. 17.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) MATHGoogle Scholar
  18. 18.
    Newman, M.E.J.: The structure and function of complex networks. SIAM Rev. 45, 167 (2003) MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search—The Metric Space Approach, vol. 32. Springer, Berlin (2006) MATHGoogle Scholar
  20. 20.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Benjamin Hoffmann
    • 1
  • Mikhail Lifshits
    • 2
  • Yury Lifshits
    • 3
  • Dirk Nowotka
    • 1
  1. 1.Universität StuttgartStuttgartGermany
  2. 2.St. Petersburg State UniversitySt. PetersburgRussia
  3. 3.California Institute of TechnologyPasadenaUSA

Personalised recommendations