Distributed and Parallel Databases

, Volume 26, Issue 1, pp 3–27

Distributed top-k aggregation queries at large

  • Thomas Neumann
  • Matthias Bender
  • Sebastian Michel
  • Ralf Schenkel
  • Peter Triantafillou
  • Gerhard Weikum
Open Access
Article

Abstract

Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network.

Keywords

Top-k Distributed queries Query optimization Cost models 

References

  1. 1.
    Akbarinia, R., Pacitti, E., Valduriez, P.: Reducing network traffic in unstructured p2p systems using top-k queries. Distributed Parallel Databases 19(2–3), 67–86 (2006) CrossRefGoogle Scholar
  2. 2.
    Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: SIGIR, pp. 372–379 (2006) Google Scholar
  3. 3.
    Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD Conference, pp. 28–39 (2003) Google Scholar
  4. 4.
    Balke, W.T., Nejdl, W., Siberski, W., Thaden, U.: Progressive distributed top k retrieval in peer-to-peer networks. In: ICDE, pp. 174–185 (2005) Google Scholar
  5. 5.
    Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in p2p search engines. In: SIGIR, pp. 67–74 (2005) Google Scholar
  6. 6.
    Brijs, T., Swinnen, G., Vanhoof, K., Wets, G.: Using association rules for product assortment decisions: A case study. In: KDD, pp. 254–260 (1999) Google Scholar
  7. 7.
    Bruno, N., Chaudhuri, S., Gravano, L.: Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Database Syst. 27(2), 153–187 (2002a) CrossRefGoogle Scholar
  8. 8.
    Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: ICDE, pp. 369– (2002b) Google Scholar
  9. 9.
    Cao, P., Wang, Z.: Efficient top-k query calculation in distributed networks. In: PODC, pp. 206–215 (2004) Google Scholar
  10. 10.
    Chang, K.C.C., won Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: SIGMOD Conference, pp. 346–357 (2002) Google Scholar
  11. 11.
    Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB, pp. 888–899 (2004a) Google Scholar
  12. 12.
    Chaudhuri, S., Gravano, L., Marian, A.: Optimizing top-k selection queries over multimedia repositories. IEEE Trans. Knowl. Data Eng. 16(8), 992–1009 (2004b) CrossRefGoogle Scholar
  13. 13.
    Church, K., Gale, W.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995) CrossRefGoogle Scholar
  14. 14.
    Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT/McGraw-Hill, Cambridge (2001) MATHGoogle Scholar
  15. 15.
    Das, G., Gunopulos, D., Koudas, N., Tsirogiannis, D.: Answering top-k queries using views. In: VLDB, pp. 451–462 (2006) Google Scholar
  16. 16.
    Das, G., Gunopulos, D., Koudas, N., Sarkas, N.: Ad-hoc top-k query answering for data streams. In: VLDB, pp. 183–194 (2007) Google Scholar
  17. 17.
    David, H., Nagaraja, H.: Order Statistics, 3rd edn. Wiley, New York (2003) MATHGoogle Scholar
  18. 18.
    Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., Tomkins, A.: Visualizing tags over time. In: WWW, pp. 193–202 (2006) Google Scholar
  19. 19.
    Garofalakis, M. (ed): Special issue on in-network query processing. IEEE Data Eng. Bull. 28(1) (2005) Google Scholar
  20. 20.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003) MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Güntzer, U., Balke, W.T., Kießling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp. 419–428 (2000) Google Scholar
  22. 22.
    Güntzer, U., Balke, W.T., Kießling, W.: Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp. 622–628 (2001) Google Scholar
  23. 23.
    Ilyas, I.F., Aref, W.G., Elmagarmid, A.K., Elmongui, H.G., Shah, R., Vitter, J.S.: Adaptive rank-aware query optimization in relational databases. ACM Trans. Database Syst. 31(4), 1257–1304 (2006) CrossRefGoogle Scholar
  24. 24.
    Information Sciences Institute The University of Southern California: The network simulator—ns-2 (2007). http://www.isi.edu/nsnam/ns/
  25. 25.
    Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB, pp. 19–30 (2003) Google Scholar
  26. 26.
    Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: SIGMOD Conference, pp. 779–790 (2004) Google Scholar
  27. 27.
    Koudas, N., Ooi, B.C., Tan, K.L.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB, pp. 804–815 (2004) Google Scholar
  28. 28.
    Li, C., Chang, K.C.C., Ilyas, I.F., Song, S.: Ranksql: Query algebra and optimization for relational top-k queries. In: SIGMOD Conference, pp. 131–142 (2005) Google Scholar
  29. 29.
    Long, X., Suel, T.: Optimized query execution in large search engines with global page ordering. In: VLDB, pp. 129–140 (2003) Google Scholar
  30. 30.
    Luo, Y., Lin, X., Wang, W., Zhou, X.: Spark: top-k keyword query in relational databases. In: SIGMOD Conference, pp. 115–126 (2007) Google Scholar
  31. 31.
    Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: Tinydb: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst. 30(1), 122–173 (2005) CrossRefGoogle Scholar
  32. 32.
    Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004) CrossRefGoogle Scholar
  33. 33.
    Michel, S., Triantafillou, P., Weikum, G.: Klee: A framework for distributed top-k query algorithms. In: VLDB, pp. 637–648 (2005) Google Scholar
  34. 34.
    Natsev, A., Chang, Y.C., Smith, J.R., Li, C.S., Vitter, J.S.: Supporting incremental join queries on ranked inputs. In: VLDB, pp. 281–290 (2001) Google Scholar
  35. 35.
    Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29 (1999) Google Scholar
  36. 36.
    Neumann, T., Michel, S.: Smooth interpolating histograms with error guarantees. In: BNCOD, pp. 126–138 (2008) Google Scholar
  37. 37.
    Neumann, T., Bender, M., Michel, S., Schenkel, R., Triantafillou, P., Weikum, G.: Optimizing distributed top-k queries. In: WISE, pp. 337–349 (2008) Google Scholar
  38. 38.
    Schnaitter, K., Spiegel, J., Polyzotis, N.: Depth estimation for ranking query optimization. VLDB J. 18(2), 521–542 (2009) CrossRefGoogle Scholar
  39. 39.
    Soliman, M.A., Ilyas, I.F., Chang, K.C.C.: Top-k query processing in uncertain databases. In: ICDE, pp. 896–905 (2007) Google Scholar
  40. 40.
    Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659 (2004) Google Scholar
  41. 41.
    de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-nn search on vertically decomposed data. In: SIGMOD Conference, pp. 322–333 (2002) Google Scholar
  42. 42.
    Winick, J., Jamin, S.: Inet-3.0: Internet topology generator. http://topology.eecs.umich.edu/inet/. Tech. Rep. UM-CSE-TR-456-02, EECS, University of Michigan, citeseer.nj.nec.com/526211.html (2002)
  43. 43.
    Xin, D., Han, J., Chang, K.C.C.: Progressive and selective merge: computing top-k with ad-hoc ranking functions. In: SIGMOD Conference, pp. 103–114 (2007) Google Scholar
  44. 44.
    Yu, H., Li, H.G., Wu, P., Agrawal, D., Abbadi, A.E.: Efficient processing of distributed top-k queries. In: DEXA, pp. 65–74 (2005) Google Scholar
  45. 45.
    Zeinalipour-Yazti, D., et al.: The threshold join algorithm for top-k queries in distributed sensor networks. In: DMSN (2005) Google Scholar
  46. 46.
    Zhang, J., Suel, T.: Efficient query evaluation on large textual collections in a peer-to-peer environment. In: Peer-to-Peer Computing, pp. 225–233 (2005) Google Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Thomas Neumann
    • 1
  • Matthias Bender
    • 1
  • Sebastian Michel
    • 2
  • Ralf Schenkel
    • 3
  • Peter Triantafillou
    • 4
  • Gerhard Weikum
    • 1
  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany
  2. 2.École Polytechnique Fédérale de LausanneLausanneSwitzerland
  3. 3.Saarland University and Max-Planck-Institut für InformatikSaarbrückenGermany
  4. 4.University of PatrasPatrasGreece

Personalised recommendations