Advertisement

InfoMall: A Large-Scale Storage System for Web Archiving

  • Lian’en Huang
  • Jinping Li
  • Xiaoming Li
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7901)

Abstract

The World Wide Web is a fluid medium which means that Web pages or entire Web sites frequently change or disappear, often without leaving any trace. Considering the great value of the Web, it is quite necessary to archive the current Web for the future. In order to do this, a large-scale storage system is required. In this paper we propose such a system which is designed for storing the massive Web pages we have been collecting consistently since 2001. One significant feature of this collection of Web pages is that it is space-time dimensioned which means every Web page is attached with a URL and a time, while one URL is possible to contain lots of Web pages crawled at different times. Our system is designed that sorted Web pages are clustered and stored together by some degree of space-time granularity. As a result, users are able to retrieve effectively Web pages with URLs and times specified or batches of Web pages with URL ranges and time ranges specified.

Keywords

Web Archiving Web Storage System Web InfoMall 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Toyoda, M., Kitsuregawa, M.: The History of Web Archiving. Proceedings of the IEEE 100, 1141–1143 (2012)CrossRefGoogle Scholar
  2. 2.
    The International Internet Preservation Consortium, http://www.netpreserve.org
  3. 3.
    The Web InfoMall, http://www.infomall.cn
  4. 4.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: 7th World Wde Web Conference, Brisbane, Australia, pp. 107–117 (1998)Google Scholar
  5. 5.
    Gruhl, D., Chavet, L., Gibson, D., et al.: How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal 43(1), 64–77 (2004)CrossRefGoogle Scholar
  6. 6.
    Hirai, J., Raghavan, S., Garcia-Molina, H., et al.: WebBase: A repository of Web pages. In: 9th World Wde Web Conference, Amsterdam, The Netherlands, pp. 277–293 (2000)Google Scholar
  7. 7.
    Cho, J., Garcia-Molina, H., Haveliwala, T., et al.: Stanford WebBase Components and Applications. ACM Transactions on Internet Technology (TOIT) 6(2), 153–186 (2006)CrossRefGoogle Scholar
  8. 8.
    Chang, F., Dean, J., Ghemawat, S., et al.: Bigtable: A distributed storage system for structured data. In: 7th USENIX Symposium on Operating Systems Design and Implementation, Seattle, USA, pp. 205–218 (2006)Google Scholar
  9. 9.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: 19th ACM Symposium on Operating Systems Principles, New York, USA, pp. 29–43 (2003)Google Scholar
  10. 10.
    Baker, J., Bond, C., Corbett, J.C., et al.: Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In: 5th Biennial Conference on Innovative Data Systems Research (CIDR 2011), Asilomar, California, USA, pp. 223–234 (2011)Google Scholar
  11. 11.
    DeCandia, G., Hastorun, D., Jampani, M., et al.: Dynamo: Amazon’s Highly Available Key-value Store. In: 21st ACM Symposium on Operating Systems Principles, Stevenson, Washington, USA, pp. 205–220 (2007)Google Scholar
  12. 12.
    Calder, B., Wang, J., Ogus, A., et al.: Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In: 23rd ACM Symposium on Operating Systems Principles, Cascais, Portugal, pp. 143–157 (2011)Google Scholar
  13. 13.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44(2), 35–40 (2010)CrossRefGoogle Scholar
  14. 14.
    Huang, L., Yan, H., Li, X.: Engineering of Web InfoMall: The Chinese Web Archive. In: World Engineers Convention 2004, Shanghai, China, vol. A, pp. 217–222 (2004)Google Scholar
  15. 15.
    The TianWang Search Engine, http://e.pku.edu.cn
  16. 16.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Lian’en Huang
    • 1
  • Jinping Li
    • 1
  • Xiaoming Li
    • 1
  1. 1.Shenzhen Key Laboratory for Cloud Computing Application and TechnologyPeking University Shenzhen Graduate SchoolShenzhenChina

Personalised recommendations