Journal of Computer Science and Technology

, Volume 31, Issue 2, pp 359–380 | Cite as

Content-Based Publish/Subscribe System for Web Syndication

  • Zeinab Hmedeh
  • Harry Kourdounakis
  • Vassilis Christophides
  • Cédric du Mouza
  • Michel Scholl
  • Nicolas Travers
Regular Paper

Abstract

Content syndication has become a popular way for timely delivery of frequently updated information on the Web. Today, web syndication technologies such as RSS or Atom are used in a wide variety of applications spreading from large-scale news broadcasting to medium-scale information sharing in scientific and professional communities. However, they exhibit serious limitations for dealing with information overload in Web 2.0. There is a vital need for efficient real-time filtering methods across feeds, to allow users to effectively follow personally interesting information. We investigate in this paper three indexing techniques for users’ subscriptions based on inverted lists or on an ordered trie for exact and partial matching. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical parameters of realistic web syndication workloads.

Keywords

pub/sub subscription indexing web syndication partial matching scalability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Hmedeh Z, Vouzoukidou N, Travers N, Christophides V, du Mouza C, Scholl M. Characterizing web syndication behavior and content. In Proc. the 12th WISE, Nov. 2011, pp.29-42.Google Scholar
  2. [2]
    Pereira J, Fabret F, Llirbat F, Preotiuc-Pietro R, Ross K A, Shasha D. Publish/subscribe on the web at extreme speed. In Proc. the 26th VLDB, Sept. 2000, pp.627-630.Google Scholar
  3. [3]
    Fabret F, Jacobsen H A, Llirbat F, Pereira J, Ross K A, Shasha D. Filtering algorithms and implementation for very fast publish/subscribe. In Proc. SIGMOD, May 2001, pp.115-126.Google Scholar
  4. [4]
    Aguilera M K, Strom R E, Sturman D C, Astley M, Chandra T D. Matching events in a content-based subscription system. In Proc. the 8th PODC, Apr. 29-May 6, 1999, pp.53-61.Google Scholar
  5. [5]
    Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Survey, 2006, 38(2): Article No. 6.Google Scholar
  6. [6]
    Knuth D E. The Art of Computer Programming, Volume III: Sorting and Searching (2nd edition). Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.Google Scholar
  7. [7]
    Yan T W, Garcia-Molina H. Index structures for selective dissemination of information under the Boolean model. ACM Transactions on Database Systems, 1994, 19(2): 332–364.CrossRefGoogle Scholar
  8. [8]
    König A C, Church K W, Markov M. A data structure for sponsored search. In Proc. the 25th ICDE, Mar. 29-April 2, 2009, pp.90-101.Google Scholar
  9. [9]
    Bodon F. Surprising results of trie-based FIM algorithms. In Proc. IEEE CIDM Workshop on FIMI, Nov. 2004.Google Scholar
  10. [10]
    Malik H H, Kender J R. Optimizing frequency queries for data mining applications. In Proc. the 7th ICDM, Oct. 2007, pp.595-600.Google Scholar
  11. [11]
    Travers N, Hmedeh Z, Vouzoukidou N, du Mouza C, Christophides V, Scholl M. RSS feeds behavior analysis, structure and vocabulary. International Journal of Web Information Systems, 2014, 10(3): 291–320.CrossRefGoogle Scholar
  12. [12]
    Yan T W, Garcia-Molina H. The SIFT information dissemination system. ACM Transactions on Database Systems, 1999, 24(4): 529–565.CrossRefGoogle Scholar
  13. [13]
    Bodon F. A trie-based APRIORI implementation for mining frequent item sequences. In Proc. the 1st Int. Work. Open Source Data Mining (OSDM), Aug. 2005, pp.56-65.Google Scholar
  14. [14]
    Clément J, Flajolet P, Vallée B. Dynamical sources in information theory: A general analysis of trie structures. Algorithmica, 2001, 29(1): 307–369.MathSciNetCrossRefMATHGoogle Scholar
  15. [15]
    Baeza-Yates R A, Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.Google Scholar
  16. [16]
    Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613–620.CrossRefMATHGoogle Scholar
  17. [17]
    Bookstein A, Swanson D. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci., 1974, 25(5): 312–316.CrossRefGoogle Scholar
  18. [18]
    Bagwell P. Ideal hash trees. Technical Report LAMPREPORT-2001-001, Ecole Polytechnique Federal de Lausanne, Switzerland, 2001.Google Scholar
  19. [19]
    Walker A J. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, 1977, 3(3): 253–256.CrossRefMATHGoogle Scholar
  20. [20]
    Beitzel S M, Jensen E C, Chowdhury A, Grossman D, Frieder O. Hourly analysis of a very large topically categorized web query log. In Proc. the 27th SIGIR, Jul. 2004, pp.321-328.Google Scholar
  21. [21]
    Carzaniga A, Wolf A. Forwarding in a content-based network. In Proc. the 17th SIGCOMM, Aug. 2003, pp.163-174.Google Scholar
  22. [22]
    Kale S, Hazan E, Cao F, Singh J P. Analysis and algorithms for content-based event matching. In Proc. the 25th Int. Conf. Distributed Computing Systems (ICDCS) Workshops, Jun. 2005, pp.363-369.Google Scholar
  23. [23]
    Wang B, Zhang W, Kitsuregawa M. UB-tree based efficient predicate index with dimension transform for pub/sub system. In Proc. the 9th DASFAA, Mar. 2004, pp.63-74.Google Scholar
  24. [24]
    Machanavajjhala A, Vee E, Garofalakis M N, Shanmugasundaram J. Scalable ranked publish/subscribe. PVLDB, 2008, 1(1): 451–462.Google Scholar
  25. [25]
    Sadoghi M, Jacobsen H A. BE-tree: An index structure to efficiently match Boolean expressions over high-dimensional discrete space. In Proc. the 30th SIGMOD, Jun. 2011, pp.637-648.Google Scholar
  26. [26]
    Whang S, Garcia-Molina H, Brower C, Shanmugasundaram J, Vassilvitskii S, Vee E, Yerneni R. Indexing Boolean expressions. PVLDB, 2009, 2(1): 37–48.Google Scholar
  27. [27]
    Sadoghi M, Jacobsen H A. Analysis and optimization for Boolean expression indexing. ACM Transactions on Database Systems, 2013, 38(2): Article No. 8.Google Scholar
  28. [28]
    Sadoghi M, Jacobsen H A. Relevance matters: Capitalizing on less (top-k matching in publish/subscribe). In Proc. the 28th ICDE, Apr. 2012, pp.786-797.Google Scholar
  29. [29]
    Petrovic M, Liu H, Jacobsen H A. G-ToPSS: Fast filtering of graph-based metadata. In Proc. the 14th WWW, May 2005, pp.539-547.Google Scholar
  30. [30]
    Liu H, Petrovic M, Jacobsen H. Efficient filtering of RSS documents on computer cluster. Technical Report, MSRG, University of Toronto, Nov. 2007.Google Scholar
  31. [31]
    Demers A J, Gehrke J, Hong M, Riedewald M, White W M. Towards expressive publish/subscribe systems. In Proc. the 10th EDBT, Mar. 2006, pp.627-644.Google Scholar
  32. [32]
    Irmak U, Mihaylov S, Suel T, Ganguly S, Izmailov R. Efficient query subscription processing for prospective search engines. In Proc. USENIX, Jun. 2006, pp.375-380.Google Scholar
  33. [33]
    Shraer A, Gurevich M, Fontoura M, Josifovski V. Top-k publish-subscribe for social annotation of news. PVLDB, 2013, 6(6): 385–396.Google Scholar
  34. [34]
    Hmedeh Z, du Mouza C, Travers N. TDV-based filter for novelty and diversity in a real-time pub/sub system. In Proc. the 19th IDEAS, Jul. 2015, pp.136-145.Google Scholar
  35. [35]
    Hmedeh Z, du Mouza C, Travers N. FiND: A real-time filtering by novelty and diversity for publish/subscribe systems. In Proc. the 27th SSDBM, June 29-July 1, 2015.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Zeinab Hmedeh
    • 1
  • Harry Kourdounakis
    • 2
  • Vassilis Christophides
    • 2
  • Cédric du Mouza
    • 1
  • Michel Scholl
    • 1
  • Nicolas Travers
    • 1
  1. 1.CEDRIC LaboratoryConservatoire National des Arts et MétiersParisFrance
  2. 2.FORTH/ICS, University of CreteHeraklionGreece

Personalised recommendations