Efficient processing of exact top-k queries over disk-resident sorted lists

Pang, HweeHwa; Ding, Xuhua; Zheng, Baihua

doi:10.1007/s00778-009-0174-x

Efficient processing of exact top-k queries over disk-resident sorted lists

Regular Paper
Published: 08 December 2009

Volume 19, pages 437–456, (2010)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

HweeHwa Pang¹,
Xuhua Ding¹ &
Baihua Zheng¹

195 Accesses
23 Citations
Explore all metrics

Abstract

The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with per-attribute sorted lists, and a threshold algorithm (TA) is applied on the lists involved in each query. The TA executes in two phases—find a cut-off threshold for the top-k result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact top-k queries that involve monotonic linear scoring functions over disk-resident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed in the two phases, so that (most of) the required records can be fetched efficiently through sequential or batched I/Os. We also devise a mechanism to quickly rank the data that qualify for the query answer and to eliminate those that do not, in order to reduce the computation demand of the query processor. Extensive experiments with four different datasets confirm that our schemes achieve substantial performance speed-up of between two times and two orders of magnitude over existing TAs, at the expense of a memory overhead of 4.8 bits per attribute value. Moreover, our scheme is robust to different data distributions and query characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adomavicius G., Tuzhilin A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE TKDE 17(6), 734–749 (2005)
Google Scholar
Akbarinia, R., Pacitti, E., Valduriez, P.: Best position algorithms for top-k queries. In: VLDB, pp. 495–506 (2007)
Arai B., Das G., Gunopulos D., Koudas N.: Anytime measures for top-k algorithms on exact and fuzzy data sets. VLDB J. 18(2), 407–427 (2009)
Article Google Scholar
Baeza-Yates R., Neto B.R.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: IO-top-k: index-access optimized top-k query processing. In: VLDB, pp. 475–486 (2006)
Bloom B.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Brin S., Page L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
Article Google Scholar
Bruno N., Wang H.W.: The threshold algorithm: from middleware systems to the relational engine. IEEE TKDE 19(4), 523–537 (2007)
Google Scholar
Chang, K.C.C., Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: ACM SIGMOD, pp. 346–357 (2002)
Chaudhuri S., Gravano L., Marian A.: Optimizing top-k selection queries over multimedia repositories. IEEE TKDE 16(8), 992–1009 (2004)
Google Scholar
Deshpande, P.M., Deepak, P., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: EDBT, pp. 356–367 (2008)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113 (2001)
Fagin R., Lotem A., Naor M.: Optimal aggregation algorithms for middleware. JCSS 66(4), 614–656 (2003)
MATH MathSciNet Google Scholar
Finger, J., Polyzotis, N.: Robust and efficient algorithms for rank join evaluation. In: ACM SIGMOD, pp. 415–428 (2009)
Güntzer, U., Balke, W.T., Kiessling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp. 419–428 (2000)
Hua, M., Pei, J., Fu, A.W.C., Lin, X., Leung, H.F.: Efficiently answering top-k typicality queries on large databases. In: VLDB, pp. 890–901 (2007)
Hua M., Pei J., Fu A.W.C., Lin X., Leung H.F.: Top-k typicality queries and efficient query answering methods on large databases. VLDB J. 18(3), 809–835 (2009)
Article Google Scholar
Hung H.P., Chuang K.T., Chen M.S.: Efficient process of top-k range-sum queries over multiple streams with minimized global error. IEEE TKDE 19(10), 1404–1419 (2007)
Google Scholar
Hwang S., Chang K.C.C.: Optimizing top-k queries for middleware access: a unified cost-based approach. ACM TODS 32(1), 5 (2007)
Article Google Scholar
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Joining ranked inputs in practice. In: VLDB, pp. 950–961 (2002)
Jin, C., Yi, K., Chen, L., Yu, J.X., Lin, X.: Sliding-window top-k queries on uncertain streams. In: VLDB, pp. 301–312 (2008)
Korn F., Pagel B.U., Faloutsos C.: On the ‘Dimensionality Curse’ and the ‘Self-Similarity Blessing’. IEEE TKDE 13(1), 96–111 (2001)
Google Scholar
Lewis D.D., Yang Y., Rose T.G., Li F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Lian, X., Chen, L.: Top-k dominating queries in uncertain databases. In: EDBT, pp. 660–671 (2009)
Mamoulis, N., Cheng, K.H., Yiu, M.L., Cheung, D.W.: Efficient aggregation of ranked inputs. In: IEEE ICDE, p. 72 (2006)
Mamoulis N., Yiu M.L., Cheng K.H., Cheung D.W.: Efficient top-k aggregation of ranked inputs. ACM TODS 32(3), 19 (2007)
Article Google Scholar
Marian A., Bruno N., Gravano L.: Evaluating top-k queries over web-accessible databases. ACM TODS 29(2), 319–362 (2004)
Article Google Scholar
Michel, S., Neumann, T.: Search for the best but expect the worst—distributed top-k queries over decreasing aggregated scores. In: WebDB (2007)
Michel, S., Triantafillou, P., Weikum, G.: KLEE: a framework for distributed top-k query algorithms. In: VLDB, pp. 637–648 (2005)
Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries over sliding windows. In: ACM SIGMOD, pp. 635–646 (2006)
Ntoulas, A., Cho, J.: Pruning policies for two-tiered inverted index with correctness guarantee. In: ACM SIGIR, pp. 191–198 (2007)
Qi, Y., Candan, K.S., Sapino, M.L.: Sum-max monotonic ranked joins for evaluating top-k twig queries on weighted data graphs. In: VLDB, pp. 507–518 (2007)
Schnaitter, K., Spiegel, J., Polyzotis, N.: Depth estimation for ranking query optimization. In: VLDB, pp. 902–913 (2007)
Shmueli-Scheuer, M., Li, C., Mass, Y., Roitman, H., Schenkel, R., Weikum, G.: Best-effort top-k query processing under budgetary constraints. In: IEEE ICDE, pp. 928–939 (2009)
Silberschatz A., Galvin P.B., Gagne G.: Operating System Concepts, 7th edn. Wiley, New York (2006)
Google Scholar
Soliman M.A., Ilyas I.F., Chang K.C.C.: Probabilistic top-k and ranking-aggregate queries. ACM TODS 33(3), 1–54 (2008)
Article Google Scholar
Spink A., Wolfram D., Jansen M.B.J., Saracevic T.: Searching the web: the public and their queries. J. Am. Soc. Inform. Sci. Technol. 52(3), 226–234 (2001)
Article Google Scholar
Tao Y., Xiao X., Pei J.: Efficient skyline and top-k retrieval in subspaces. IEEE TKDE 19(8), 1072–1088 (2007)
Google Scholar
Theobald M., Bast H., Majumdar D., Schenkel R., Weikum G.: TopX: efficient and versatile top-k query processing for semistructured data. VLDB J. 17(1), 81–115 (2008)
Google Scholar
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659 (2004)
TREC: Text REtrieval Conference. http://trec.nist.gov/
Vlachou, A., Doulkeridis, C., Norvåg, K., Vazirgiannis, M.: On efficient top-k query processing in highly distributed environments. In: ACM SIGMOD, pp. 753–764 (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: IEEE ICDE, pp. 916–927 (2009)
Xin, D., Han, J., Chang, K.C.C.: Progressive and selective merge: computing top-k with ad-hoc ranking functions. In: ACM SIGMOD, pp. 103–114 (2007)
Yi K., Li F., Kollios G., Srivastava D.: Efficient processing of top-k queries in uncertain databases with x-relations. IEEE TKDE 20(12), 1669–1682 (2008)
Google Scholar
Yiu, M.L., Mamoulis, N.: Efficient processing of top-k dominating queries on multi-dimensional data. In: VLDB, pp. 483–494 (2007)
Yiu M.L., Mamoulis N.: Multi-dimensional top-k dominating queries. VLDB J. 18(3), 695–718 (2009)
Article Google Scholar
Yiu, M.L., Mamoulis, N., Vaitis, M.: Top-k spatial preference queries. In: IEEE ICDE, pp. 1076–1085 (2007)
Zhu L., Rao A., Zhang A.: Theory of keyblock-based image retrieval. ACM TODS 20(2), 224–257 (2002)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engine. ACM Comput. Surv. 38(2), Article No. 6 (2006)
Zou, L., Chen, L.: Dominant graph: an efficient indexing structure to answer top-k queries. In: ICDE, pp. 536–545 (2008)

Download references

Author information

Authors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
HweeHwa Pang, Xuhua Ding & Baihua Zheng

Authors

HweeHwa Pang
View author publications
You can also search for this author in PubMed Google Scholar
Xuhua Ding
View author publications
You can also search for this author in PubMed Google Scholar
Baihua Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HweeHwa Pang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pang, H., Ding, X. & Zheng, B. Efficient processing of exact top-k queries over disk-resident sorted lists. The VLDB Journal 19, 437–456 (2010). https://doi.org/10.1007/s00778-009-0174-x

Download citation

Received: 19 March 2009
Revised: 28 September 2009
Accepted: 10 November 2009
Published: 08 December 2009
Issue Date: June 2010
DOI: https://doi.org/10.1007/s00778-009-0174-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient processing of exact top-k queries over disk-resident sorted lists

Abstract

Access this article

Similar content being viewed by others

ListMerge: Accelerating Top-k Aggregation Queries Over Large Number of Lists

TKAP: Efficiently processing top-k query on massive data by adaptive pruning

Fast Top-Q and Top-K Query Answering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient processing of exact top-k queries over disk-resident sorted lists

Abstract

Access this article

Similar content being viewed by others

ListMerge: Accelerating Top-k Aggregation Queries Over Large Number of Lists

TKAP: Efficiently processing top-k query on massive data by adaptive pruning

Fast Top-Q and Top-K Query Answering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation