Skip to main content

Evaluating continuous top-k queries over document streams

Abstract

At the age of Web 2.0, Web content becomes live, and users would like to automatically receive content of interest. Popular RSS subscription approach cannot offer fine-grained filtering approach. In this paper, we propose a personalized subscription approach over the live Web content. The document is represented by pairs of terms and weights. Meanwhile, each user defines a top-k continuous query. Based on an aggregation function to measure the relevance between a document and a query, the user continuously receives the top-k most relevant documents inside a sliding window. The challenge of the above subscription approach is the high processing cost, especially when the number of queries is very large. Our basic idea is to share evaluation results among queries. Based on the defined covering relationship of queries, we identify the relations of aggregation scores of such queries and develop a graph indexing structure (GIS) to maintain the queries. Next, based on the GIS, we propose a document evaluation algorithm to share query results among queries. After that, we re-use evaluation history documents, and design a document indexing structure (DIS) to maintain the history documents. Finally, we adopt a cost model-based approach to unify the approaches of using GIS and DIS. The experimental results show that our solution outperforms the previous works using the classic inverted list structure.

This is a preview of subscription content, access via your institution.

References

  1. Callan, J.P.: Document filtering with inference networks. In: SIGIR, pp. 262–269 (1996)

  2. Chandramouli, B., Phillips, J., Yang, J.: Value-based notification conditions in large-scale publish/subscribe systems. In: VLDB, pp. 878–889 (2007)

  3. Cuenca-Acuna, F.M., Nguyen, T.D.: Text-based content search and retrieval in ad-hoc p2p communities. In: Networking Workshops, pp. 220–234 (2002)

  4. Das, G., Gunopulos, D., Koudas, N., Sarkas, N.: Ad-hoc top-k query answering for data streams. In: VLDB, pp. 183–194 (2007)

  5. Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J., Ross, K.A., Shasha, D.: Filtering algorithms and implementation for very fast publish/subscribe. In: SIGMOD Conference, pp. 115–126 (2001)

  6. Haghani, P., Michel, S., Aberer, K.: The gist of everything new: personalized top-k processing over web 2.0 streams. In: CIKM, pp. 489–498 (2010)

  7. Liu, Z., S.P. 0002, Ranganathan, A., Yang, H.: Near-optimal algorithms for shared filter evaluation in data stream systems. In: SIGMOD Conference, pp. 133–146 (2008)

  8. Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries over sliding windows. In : SIGMOD Conference, pp. 635–646 (2006)

  9. Mouratidis, k., Pang, h.: An incremental threshold method for continuous text search queries. In: ICDE, pp. 1187–1190 (2009)

  10. Mouratidis, K., Pang, H.: Efficient evaluation of continuous text search queries. IEEE Trans. Knowl. Data Eng. 23(10), 1469–1482 (2011)

    Article  Google Scholar 

  11. Munagala, K., Srivastava, U., Widom, J.: Optimization of continuous queries with shared expensive filters. In: PODS, pp. 215–224 (2007)

  12. Rao, W., Chen, L.: A distributed full-text top-k document dissemination system in distributed hash tables. World Wide Web 14(5–6), 545–572 (2011)

    Article  Google Scholar 

  13. Rao, W., Chen, L.: Distributed top-k full-text content dissemination. Distributed and Parallel Databases 30(3–4), 273–301 (2012)

    Article  Google Scholar 

  14. Rao, W., Chen, L., Fu, A.W.: On efficient content matching in distributed pub/sub systems. In: INFOCOM (2009)

  15. Rao, W., Chen, L., Fu, A.W.C.: Stairs: towards efficient full-text filtering and dissemination in dht environments. VLDB J. 20(6), 793–817 (2011)

    Article  Google Scholar 

  16. Rao, W., Chen, L., Hui, P., Tarkoma, S.: Move: a large scale keyword-based content filtering and dissemination system. In: ICDCS, pp. 445–454 (2012)

  17. Rao, W., Fu, A.W.C., Chen, L., Chen, H.: Stairs: towards efficient full-text filtering and dissemination in a dht environment. In: ICDE (2009)

  18. Rao, W., Vitenberg, R., Tarkoma, S.: Towards optimal keyword-based content dissemination in dht-based p2p networks. In: Peer-to-Peer Computing, pp. 102–111 (2011)

  19. Rose, I., Murty, R., Pietzuch, P.R., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: content-based filtering and aggregation of blogs and rss feeds. In: NSDI (2007)

  20. Bianchi, P.F.S., Datta, A.K., Gradinariu, M.: Stabilizing dynamic r-tree-based spatial filters. In: ICDCS, pp. 447–457 (2007)

  21. Tang, C., Dwarkadas, S.: Hybrid global-local indexing for efficient peer-to-peer information retrieval. In: NSDI, pp. 211–224 (2004)

  22. Tang, C., Xu, Z., Mahalingam, M.: Psearch: information retrieval in structured overlays. In: HotNets-I (2002)

  23. Tao, Y., Xiao, X., Pei, J.: Subsky: efficient computation of skylines in subspaces. In: ICDE00000, p. 65 (2006)

  24. Tryfonopoulos, C., Idreos, S., Koubarakis, M.: Publish/subscribe functionality in IR environments using structured overlay networks. In: SIGIR, pp. 322–329 (2005)

  25. Tryfonopoulos, C., Koubarakis, M., Drougas, Y.: Information filtering and query indexing for an information retrieval model. ACM Trans. Inf. Syst. 27(2), 10:1–10:47 (2009)

    Article  Google Scholar 

  26. Yan, T.W., Garcia-Molina, H.: The SIFT information dissemination system. ACM Trans. Database Syst. 24(4), 529–565 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weixiong Rao.

Additional information

Part of this work was done when the first author is currently affilicated with the Department of Computer Science, University of Helsinki, Finland and visting at China R&D Center for Internet of Things, Wuxi, China.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Rao, W., Chen, L., Chen, S. et al. Evaluating continuous top-k queries over document streams. World Wide Web 17, 59–83 (2014). https://doi.org/10.1007/s11280-012-0191-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-012-0191-3

Keywords

  • top-k query
  • information filtering
  • web document streams