Abstract
A web robot is a program that downloads and stores web pages. Implementation issues of web robots have been studied widely and various web statistics are reported in the literature. First, this paper describes the overall architecture of our robot and the implementation decisions on several important issues. Second, we show empirical statistics on approximately 73 million Korean web pages. We also identify what factors of web pages could affect the page changes. The factors may be used for the selection of web pages to be updated incrementally.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This work was supported by grant number R-01-2000-00403 from the Korea Science and Engineering Foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brewington, B., Cybenko, G.: How Dynamic is the Web?. Proc. 9th WWW Conf., Amsterdam (2000) 257–276
Burner, M.: Crawling Towards Eternity: Building an Archive of the World Wide Web. Web Techniques Magazine, Vol. 2. No. 5. (1997) 37–40
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. Proc. 26th VLDB Conf., Cairo (2000) 200–209
Cho, J., Garcia-Molina, H.: Synchronizing a Database to Improve Freshness. Proc. 26th SIGMOD Conf., Dallas (2000) 117–128
Cho, J., Garcia-Molina, H.: Parallel Crawlers. Proc. 11th WWW Conf., Honolulu (2002) 124–135
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. Proc. 7th WWW Conf., Brisbane (1998) 161–172
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling using Context Graphs. Proc. 26th VLDB Conf., Cairo (2000) 527–534
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. International Journal of WWW, Vol. 2. No. 4. (1999) 219–229
Heydon, A., Najork, M.: Performance Limitations of the Java Core Libraries. Proc. 1st Java Grande Conf., San Francisco (1999) 35–41
Najork, M., Wiener, J.L.: Breadth-first Crawling Yields High-quality Pages. Proc. 10th WWW Conf., Hong Kong (2001) 114–118
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. Proc. 27th VDLB Conf., Roma (2001) 129–138
Suel, T., Yuan, J.: Compressing the Graph Structure of the Web. Proc. 11th Data Compression Conf., Snowbird (2001) 213–222
Shkapenyuk, V., Suel, T.: Design and Implementation of a High-performance Distributed Web Crawler. Proc. 18th Data Engineering Conf., San Jose (2002) 357–368
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, S.J., Lee, S.H. (2003). Implementation of a Web Robot and Statistics on the Korean Web. In: Chung, CW., Kim, CK., Kim, W., Ling, TW., Song, KH. (eds) Web and Communication Technologies and Internet-Related Social Issues — HSI 2003. HSI 2003. Lecture Notes in Computer Science, vol 2713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45036-X_35
Download citation
DOI: https://doi.org/10.1007/3-540-45036-X_35
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40456-9
Online ISBN: 978-3-540-45036-8
eBook Packages: Springer Book Archive