Abstract
Software systems like Web crawlers, Web archives or Web caches depend on or may be improved with the knowledge of update times of remote sources. In the literature, based on the assumption of an exponential distribution of time intervals between updates, diverse statistical methods were presented to find optimal reload times of remote sources. In this article first we present the observation that the time behavior of a fraction of Web data may be described more precisely by regular or quasi regular grammars. Second we present an approach to estimate the parameters of such grammars automatically. By comparing a reload policy based on regular approximation to previous exponential-distribution based methods we show that the quality of local copies of remote sources concerning ’freshness’ and the amount of lost data may be improved significantly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the web. ACM Trans. Inter. Tech. 1(1), 2–43 (2001)
Brewington, B.E., Cybenko, G.: How dynamic is the Web? Computer Networks (Amsterdam, Netherlands: 1999) 33(1-6), 257–276 (2000)
Cho, J., Ntoulas, A.: Effective change detection using sampling. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)
Coffman, E., Liu, Z., Weber, R.R.: Optimal robot scheduling for web search engines. Journal of Scheduling 1(1), 15–29 (1998)
World Wide Web Consortium. W3c httpd, http://www.w3.org/Protocols/
Dingle, A., Partl, T.: Web cache coherence. Computer Networks and ISDN Systems 28(7-11), 907–920 (1996)
Dupont, P., Miclet, L., Vidal, E.: What is the search space of the regular inference? In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 25–37. Springer, Heidelberg (1994)
Gold, E.: Language identification in the limit. Information and Control 10, 447–474 (1967)
Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Proceedings of SIGMOD, May 2002, pp. 73–84 (2002)
Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. In: Perez, Sanfeliu, Vidal (eds.) Pattern Recognition and Image Analysis, pp. 49–61. World Scientific, Singapore (1992)
Parekh, R., Honavar, V.: Learning dfa from simple examples. Machine Learning 44(1/2), 9–35 (2001)
Rhea, S.C., Liang, K., Brewer, E.: Value-based web caching. In: WWW 2003, pp. 619–628 (2003)
Wessels, D.: Intelligent caching for world-wide web objects. In: Proceedings of INET 1995, Honolulu, Hawaii, USA (1995)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proceedings of the eleventh international conference on World Wide Web, pp. 136–147. ACM Press, New York (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kukulenz, D. (2004). Capturing Web Dynamics by Regular Approximation. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_55
Download citation
DOI: https://doi.org/10.1007/978-3-540-30480-7_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23894-2
Online ISBN: 978-3-540-30480-7
eBook Packages: Springer Book Archive