World Wide Web

, Volume 18, Issue 4, pp 913–947 | Cite as

Online refresh strategies for content based feed aggregation

Article
  • 202 Downloads

Abstract

With the rapid growth of data sources, services and devices connected to the Internet, online available web content is getting more and more diverse and dynamic. In order to facilitate the efficient dissemination of evolving and temporary information, many web applications publish their new information as RSS and Atom documents which are then collected and transformed by RSS aggregators like Feedly or Yahoo! News. This article addresses the particular issue of large scale aggregation of highly dynamic information sources by focusing on the design of optimal refresh strategies for large collections of RSS feed documents. First, we introduce two quality measures specific to RSS aggregation which reflect the information completeness and average freshness of the result feeds. Then, we propose a best effort feed refresh strategy that achieves maximum aggregation quality compared with all other existing policies with the same average number of refreshes. This strategy is based on specific online change estimation models developed after a deep analysis of the temporal publication characteristics of a representative collection of real-world RSS feeds. The presented methods have been implemented and tested against synthetic and real-world RSS feed data sets.

Keywords

RSS feed Refresh strategy Online change estimation Content based feed aggregation Dynamic web content Data quality 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW, pp. 280–290 (2003)Google Scholar
  2. 2.
    Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence (2010)Google Scholar
  3. 3.
    Brewington, B.E., Cybenko, G.: How dynamic is the Web? Comput. Netw. 33(1–6), 257–276 (2000)CrossRefGoogle Scholar
  4. 4.
    Bright, L., Gal, A., Raschid, L.: Adaptive pull-based policies for wide area data delivery. ACM Trans. Database Syst. 31(2), 631–671 (2006)CrossRefGoogle Scholar
  5. 5.
    Chatfield, C.: The Analysis of Time Series: An Introduction. CRC Press. (2004)Google Scholar
  6. 6.
    Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) SIGMOD Conference, pp. 117128. ACM (2000)Google Scholar
  7. 7.
    Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G.,Whang, K.Y. (eds.) VLDB, 200209. Morgan Kaufmann (2000)Google Scholar
  8. 8.
    Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pp. 124–135. ACM, New York, NY, USA (2002). http://doi.acm.org/10.1145/511446.511464
  9. 9.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003). http://doi.acm.org/10.1145/958942.958945
  10. 10.
    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). http://doi.acm.org/10.1145/857166.857170
  11. 11.
    Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. 30(1-7), 161–172 (1998)Google Scholar
  12. 12.
    Coffman, E.G., Liu, Z., Weber, R.R.: Optimal robot scheduling for Web search engines. J. Sched. 1(1), 15–29 (1998)MATHMathSciNetCrossRefGoogle Scholar
  13. 13.
    Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW, pp. 106–113 (2001)Google Scholar
  14. 14.
    Gruhl, D., Guha, R.V., Liben-Nowell, D., Tomkins, A.: Information diffusion through blogspace. In: Feldman, S.I., Uretsky, M., Najork, M., Wills, C.E. (eds.) WWW, pp. 491501. ACM (2004)Google Scholar
  15. 15.
    Gwertzman, J., Seltzer, M.I.: World wide web cache consistency. In: USENIX annual technical conference, pp. 141–152 (1996)Google Scholar
  16. 16.
    Hmedeh, Z., Vouzoukidou, N., Travers, N., Christophides, V., du Mouza, C., Scholl, M.: Characterizing web syndication behavior and content. In: WISE’11, The 11th International Conference on Web Information System Engineering, LNCS, pp 29–42, Sidney (2011)Google Scholar
  17. 17.
    Horincar, R., Amann, B., Artières, T.: Best-effortr refresh strategies for content-based RSS feed aggregation. In: Chen, L., Triantafillou, P., Suel, T. (eds.) WISE, Lecture Notes in Computer Science, vol. 6488, pp. 262–270. Springer (2010)Google Scholar
  18. 18.
    Horincar, R., Amann, B., Artiėres, T.: Online refresh strategies for RSS feed crawlers. In: BDA’11, 27ėmes Journėes Bases de Donnėes Avancėes. Rabat, Maroc (2011)Google Scholar
  19. 19.
    Horincar, R., Amann, B., Artières, T.: Online Change Estimation Models for Dynamic Web Resources. In: ICWE’12, The 12th International Conference on Web Engineering (ICWE). Berlin (2012)Google Scholar
  20. 20.
    Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retr. 4(3), 175–246 (2010)MATHCrossRefGoogle Scholar
  21. 21.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Huai, J., Chen, R., Hon, H.W., Liu, Y., Ma, W.Y., Tomkins, A., Zhang, X. (eds.) WWW, pp. 437446. ACM (2008)Google Scholar
  22. 22.
    Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Conference, pp. 7384, ACM (2002)Google Scholar
  23. 23.
    O’Reilly, T.: What Is Web 2.0? Design Patterns and Business Models for the Next Generation of Software (2005). http://oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
  24. 24.
    Pandey, S., Dhamdhere, K., Olston, C.: WIC: A General-purpose algorithm for monitoring web information sources. In: Nascimento, M.A., O zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB, 360371. Morgan Kaufmann (2004)Google Scholar
  25. 25.
    Pandey, S., Olston, C.: User-centric web crawling. In: Ellis, A., Hagino, T. (eds.) WWW, pp. 401411. ACM (2005)Google Scholar
  26. 26.
    Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: WWW, pp. 659–668 (2003)Google Scholar
  27. 27.
  28. 28.
    Reichert, S., Urbansky, D., Muthmann, K., Katz, P., Wauer, M., Schill, A.: Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds. In: Taniar, D., Pardede, E., Nguyen, H.Q., Rahayu, J.W., Khalil, I. (eds.) iiWAS, pp. 4451. ACM (2011)Google Scholar
  29. 29.
    Roitman, H., Carmel, D., Yom-Tov, E.: Maintaining dynamic channel profiles on the web. PVLDB 1(1), 151–162 (2008)Google Scholar
  30. 30.
  31. 31.
  32. 32.
    Saporta, G.: Probabilités, analyse des données et statistique. Technip (2006)Google Scholar
  33. 33.
    Sia, K.C., Cho, J., Cho, H.K.: Efficient monitoring algorithm for fast news alerts. IEEE Trans. on Knowl. and Data Eng. 19(7), 950–961 (2007). doi: 10.1109/TKDE.2007.1041 CrossRefGoogle Scholar
  34. 34.
    Sia, K.C., Cho, J., Hino, K., Chi, Y., Zhu, S., Tseng, B.L.: Monitoring RSS feeds based on user browsing pattern. In: Proceedings of the International Conference on Weblogs and Social Media (Boulder Colorado, March 2007), pp. 161–168 (2007)Google Scholar
  35. 35.
    Stewart, J.: Calculus: Early Transcendentals. Brooks/Cole (1991)Google Scholar
  36. 36.
    The Atom Publishing Protocol. http://tools.ietf.org/html/rfc5023
  37. 37.
    Tomàs, J.C., Amann, B., Travers, N., Vodislav, D.: RoSeS: A continuous content-based query engine for RSS feeds. In: Hameurlain, A., Liddle, S.W., Schewe, K.D., Zhou, X. (eds.) DEXA (2), Lecture Notes in Computer Science, vol. 6861, pp. 203218. Springer (2011)Google Scholar
  38. 38.
    Urbansky, D., Reichert, S., Muthmann, K., Schuster, D., Schill, A.: An optimized web feed aggregation approach for generic feed types. In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) ICWSM. The AAAI Press (2011)Google Scholar
  39. 39.
    Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002)Google Scholar
  40. 40.
    Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., Weikum, G.: Approximate information filtering in peer-to-peer networks. In: Bailey, J., Maier, D., Schewe, K.D., Thalheim, B., Wang, X.S. (eds.) WISE, Lecture Notes in Computer Science, vol. 5175, pp. 619. Springer (2008)Google Scholar
  41. 41.
    Zimmer, C., Tryfonopoulos, C., Berberich, K., Weikum, G., Koubarakis, M.: Node Behavior Prediction for Large-Scale Approximate Information Filtering 1st International Workshop on Large Scale Distributed Systems for Information Retrieval (LSDS-IR 2007) (2007)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Roxana Horincar
    • 1
  • Bernd Amann
    • 1
  • Thierry Artières
    • 1
  1. 1.LIP6University Pierre et Marie CurieParisFrance

Personalised recommendations