International Journal on Digital Libraries

, Volume 6, Issue 4, pp 327–349

Using the web infrastructure to preserve web pages

  • Michael L. Nelson
  • Frank McCown
  • Joan A. Smith
  • Martin Klein
REGULAR PAPER

Abstract

To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).

Keywords

Web infrastructure Digital preservation Web pages OAI-PMH Complex objects 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    GNU wget GNU Project Free Software Foundation (FSF). URL: http://www.gnu.org/software/wget/wget.htmlGoogle Scholar
  2. 2.
    Abiteboul, S., Cobena, G., Masanes, J., Sedrati, G.: A first experience in archiving the French web. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 1–15 (2002)Google Scholar
  3. 3.
    Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a research library for the history of the web. In: JCDL ’06: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 95–102. doi:10.1145/1141753.1141771 (2006)Google Scholar
  4. 4.
    Baeza-Yates, R., Castillo, C.: Crawling the infinite web: five levels are enough. In: Proceedings of the Third Workshop on Web Graphs (WAW), vol. 3243, pp. 156–167 (2004)Google Scholar
  5. 5.
    Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: towards an understanding of the web’s decay. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pp. 328–337. doi:10.1145/988672.988716 (2004)Google Scholar
  6. 6.
    Beck, M., Moore, T., Plank, J.S.: An end-to-end approach to globally scalable network storage. In: SIGCOMM ’02: Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 339–346. doi:10.1145/633025.633058 (2002)Google Scholar
  7. 7.
    Bekaert J., De Kooning E. and Vande Sompel H. (2006). Representing digital objects using MPEG-21 Digital Item Declaration. Int. J. Digital Libraries 6(2): 159–173. doi:10.1007/s00799-005-0133-0 CrossRefGoogle Scholar
  8. 8.
    Bekaert, J., Hochstenbach, P., Van de Sompel, H.: Using MPEG-21 DIDL to represent complex digital objects in the Los Alamos National Laboratory digital library. D-Lib Magaz. 9(11) (2003). doi:10.1045/november2003-bekaertGoogle Scholar
  9. 9.
    Bekaert, J., Liu, X., Van de Sompel, H.: aDORe: a modular and standards-based digital object repository at the Los Alamos National Laboratory. In: JCDL ’05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, p. 367. doi:10.1145/1065385.1065470 (2005)Google Scholar
  10. 10.
    Bergman, M.K.: The deep web: surfacing hidden value. J. Electron. Publishing 7(1) (2001). URL: http://www.press.umich.edu/ jep/07-01/bergman.htmlGoogle Scholar
  11. 11.
    Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 91–106 (2002)Google Scholar
  12. 12.
    Berners-Lee, T.: Cool URIs don’t change (1998). http://www.w3. org/Provider/Style/URI.htmlGoogle Scholar
  13. 13.
    Bharat, K., Broder, A.: Mirror, mirror on the web: a study of host pairs with replicated content. In: Proceedings of WWW ’99, pp. 1579–1590. doi:10.1016/S1389-1286(99)00021-3 (1999)Google Scholar
  14. 14.
    Brandman O., Cho J., Garcia-Molina H. and Shivakumar N. (2000). Crawler-friendly web servers. SIGMETRICS Perform. Eval. Rev 28(2): 9–14. doi:10.1145/362883.362894 CrossRefGoogle Scholar
  15. 15.
    Broder, A.Z., Najork, M., Wiener, J.L.: Efficient URL caching for World Wide Web crawling. In: Proceedings of WWW ’03, pp. 679–689. doi:10.1145/775152.775247 (2003)Google Scholar
  16. 16.
    Chen P.M., Lee E.K., Gibson G.A., Katz R.H. and Patterson D.A. (1994). RAID: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2): 145–185. doi:10.1145/176979.176981 CrossRefGoogle Scholar
  17. 17.
    Chen, Y., Edler, J., Goldberg, A., Gottlieb, A., Sobti, S., Yianilos, P.: A prototype implementation of archival intermemory. In: DL ’99: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 28–37. doi:10.1145/313238.313249 (1999)Google Scholar
  18. 18.
    Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proceedings of VLDB ’00, pp. 200–209 (2000)Google Scholar
  19. 19.
    Cho, J., Garcia-Molina, H.: Parallel crawlers. In: WWW ’02: Proceedings of the 11th International Conference on World Wide Web, pp. 124–135. doi:10.1145/511446.511464 (2002)Google Scholar
  20. 20.
    Cho J. and Garcia-Molina H. (2003). Effective page refresh policies for web crawlers. ACM Trans. Database Systems (TODS) 28(4): 390–426. doi:10.1145/958942.958945 CrossRefGoogle Scholar
  21. 21.
    Cho J. and Garcia-Molina H. (2003). Estimating frequency of change. ACM Trans. Internet Technol. 3(3): 256–290. doi:10.1145/ 857166.857170 CrossRefGoogle Scholar
  22. 22.
    Cho J., Garcia-Molina H., Haveliwala T., Lam W., Paepcke A., Raghavan S. and Wesley G. (2006). Stanford Webbase components and applications. ACM Trans. Internet Technol 6(2): 153–186. doi: 10.1145/1149121.1149124 CrossRefGoogle Scholar
  23. 23.
    Cho J., Garcia-Molina H. and Page L. (1998). Efficient crawling through url ordering. Comput. Netw. ISDN Systems 30(1–7): 161–172 CrossRefGoogle Scholar
  24. 24.
    Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 355–366. doi:10.1145/342009.335429 (2000)Google Scholar
  25. 25.
    Christensen, N.: Preserving the bits of the Danish Internet. In: 5th International Web Archiving Workshop (IWAW05) (2005). http://www.iwaw.net/05/papers/iwaw05-christensen.pdfGoogle Scholar
  26. 26.
    Clarke I., Miller S.G., Hong T.W., Sandberg O. and Wiley B. (2002). Protecting free expression online with Freenet. IEEE Internet Comput. 6(1): 40–49. doi:10.1109/4236.978368 CrossRefGoogle Scholar
  27. 27.
    Consultative Committee for Space Data Systems: Reference model for an open archival information system (OAIS). Tech. rep. (2002)Google Scholar
  28. 28.
    Cooper, B., Crespo, A., Garcia-Molina, H.: Implementing a reliable digital object archive. In: ECDL ’00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pp. 128–143 (2000)Google Scholar
  29. 29.
    Cooper B.F. and Garcia-Molina H. (2002). Peer-to-peer data trading to preserve information. ACM Trans. Inf. Systems (TOIS) 20(2): 133–170. doi:10.1145/506309.506310 CrossRefGoogle Scholar
  30. 30.
    Cooper B.F. and Garcia-Molina H. (2005). Infomonitor: Unobtrusively archiving a World Wide Web server. Int. J. Digital Libraries 5(2): 106–119 CrossRefGoogle Scholar
  31. 31.
    Dabek, F., Kaashoek, M.F., Karger, D., Morris, R., Stoica, I.: Wide-area cooperative storage with CFS. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01) (2001)Google Scholar
  32. 32.
    Day, M.: Collecting and preserving the World Wide Web (2003). URL: http://library.wellcome.ac.uk/assets/WTL039229.pdfGoogle Scholar
  33. 33.
    Dingledine, R., Freedman, M.J., Molnar, D.: The Free Haven project: distributed anonymous storage service. In: International Workshop on Designing Privacy Enhancing Technologies, pp. 67–95 (2001)Google Scholar
  34. 34.
    Dyreson, C.E., Lin, H., Wang, Y.: Managing versions of web documents in a transaction-time web server. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pp. 422–432 (2004). doi:10.1145/988672.988730Google Scholar
  35. 35.
    E.G. Coffman J., Liu Z. and Weber R.R. (1998). Optimal robot scheduling for web search engines. J. Scheduling 1(1): 15–29 CrossRefGoogle Scholar
  36. 36.
    Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW ’01: Proceedings of the 10th International Conference on World Wide Web, pp. 106–113 (2001). doi:10.1145/371920.371960Google Scholar
  37. 37.
    Feise, J.: An approach to persistence of web resources. In: HYPERTEXT ’01: Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, pp. 215–216 (2001). doi:10.1145/504216.504267Google Scholar
  38. 38.
    Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6 (2004). doi:10.1145/1017074.1017077Google Scholar
  39. 39.
    Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: WWW ’03: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678 (2003). doi:10.1145/775152.775246Google Scholar
  40. 40.
    Fielding, R.T.: Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine Department of Computer Science (2000). URL: http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htmGoogle Scholar
  41. 41.
    Gladney H.M. (2004). Trustworthy 100-year digital objects: evidence after every witness is dead. ACM Trans. Inf. Systems (TOIS) 22(3): 406–436. doi:10.1145/1010614.1010617 CrossRefGoogle Scholar
  42. 42.
    Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 902–903 (2005). doi:10.1145/1062745.1062789Google Scholar
  43. 43.
    Gupta, V., Campbell, R.: Internet search engine freshness by web server help. In: SAINT ’01: Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001), pp. 113–119 (2001)Google Scholar
  44. 44.
    Gutteridge, C., Harnad, S.: Applications, potential problems and a suggested policy for institutional e-print archives. Tech. Rep. 6768, University of Southampton, Intelligence, Agents, Multimedia Systems Group (2002)Google Scholar
  45. 45.
    Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 271–279 (2004)Google Scholar
  46. 46.
    Hafri, Y., Djeraba, C.: High performance crawling system. In: MIR ’04: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information retrieval, pp. 299–306 (2004). doi:10.1145/1026711.1026760Google Scholar
  47. 47.
    Hammond, T., Hannay, T., Lund, B., Scott, J.: Social bookmarking tools (I): a general review. D-Lib Magaz. 11(4) (2005). doi:10.1045/april2005-hammondGoogle Scholar
  48. 48.
    Harrison, T.L.: Opal: In vivo based preservation framework for locating lost web pages. Master’s thesis, Old Dominion University (2005). URL:http://www.cs.odu.edu/~tharriso/thesis/Google Scholar
  49. 49.
    Harrison, T.L., Nelson, M.L.: Just-in-time recovery of missing web pages. In: HYPERTEXT ’06: Proceedings of the Seventeenth ACM Conference on Hypertext and Hypermedia (2006)Google Scholar
  50. 50.
    Kahle B. (1997). Preserving the Internet. Sci. Am. 276(3): 82–83 CrossRefGoogle Scholar
  51. 51.
    Kantor, B., Lapsley, P.: Network news transfer protocol (1986)Google Scholar
  52. 52.
    Koehler W. (2002). Web page change and persistence—a four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol. 53(2): 162–171. doi:10.1002/asi.10018 CrossRefGoogle Scholar
  53. 53.
    Lagoze, C., Arms, W., Gan, S., Hillmann, D., Ingram, C., Krafft, D., Marisa, R., Phipps, J., Saylor, J., Terrizzi, C., Hoehn, W., Millman, D., Allan, J., Guzman-Lara, S., Kalt, T.: Core services in the architecture of the national science digital library (NSDL). In: JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 201–209 (2002). doi:10.1145/544220.544264Google Scholar
  54. 54.
    Lagoze, C., Van de Sompel, H.: The Open Archives Initiative: building a low-barrier interoperability framework. In: JCDL ’01: Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 54–62 (2001). doi:10.1145/379437.379449Google Scholar
  55. 55.
    Lampos, C., Eirinaki, M., Jevtuchova, D., Vazirgiannis, M.: Archiving the Greek Web. In: 4th International Web Archiving Workshop (IWAW04) (2004)Google Scholar
  56. 56.
    Lannom, L.: Handle system overview. ICSTI Forum (30) (1999). URL: http://www.icsti.org/forum/30/Google Scholar
  57. 57.
    Lawrence S., Giles C.L. and Bollacker K. (1999). Digital libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71. doi:10.1109/2.769447 Google Scholar
  58. 58.
    Lawrence S., Pennock D.M., Flake G.W., Krovetz R., Etzee F.M.C., Glover E., Nielsen F., Kruger A. and Giles C.L. (2001). Persistence of web references in scientific research. IEEE Computer 34(2): 26–31 Google Scholar
  59. 59.
    Maniatis P., Roussopoulos M., Giuli T.J., Rosenthal D.S.H. and Baker M. (2005). The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Systems 23(1): 2–50. doi:10.1145/1047915.1047917 CrossRefGoogle Scholar
  60. 60.
    Marcum, D.B.: We can’t save everything. CLIR Issues (5) (1998). http://www.clir.org/pubs/issues/issues05.htmlGoogle Scholar
  61. 61.
    Marill, J.L., Boyko, A., Ashenfelder, M., Graham, L.: Tools and techniques for harvesting the World Wide Web. In: JCDL ’04: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, p. 403 (2004). doi:10.1145/996350.996469Google Scholar
  62. 62.
    Masanès, J.: Archiving the deep web. In: Proceedings of the 2nd International Web Archiving Workshop (IWAW’02) (2002)Google Scholar
  63. 63.
    McCown, F., Chan, S., Nelson, M.L., Bollen, J.: The availability and persistence of web references in D-Lib Magazine. In: 5th International Web Archiving Workshop (IWAW’05) (2005). URL: http://www.iwaw.net/05/papers/iwaw05-mccown1.pdfGoogle Scholar
  64. 64.
    McCown, F., Nelson, M.L.: Evaluation of crawling policies for a web-repository crawler. In: HYPERTEXT ’06: Proceedings of the Seventeenth ACM Conference on Hypertext and Hypermedia, pp 145–156 (2006). doi:10.1145/1149941.1149972Google Scholar
  65. 65.
    McCown, F., Smith, J.A., Nelson, M.L., Bollen, J.: Reconstructing websites for the lazy webmaster. Tech. Rep. arXiv cs.IR/0512069 (2005). http://arxiv.org/abs/cs.IR/0512069Google Scholar
  66. 66.
    McCown, F., Smith, J.A., Nelson, M.L., Bollen, J.: Lazy preservation: Reconstructing websites by crawling the crawlers. In: WIDM ’06: Proceedings of the 8th Annual ACM International Workshop on Web Information and Data Management (2006)Google Scholar
  67. 67.
    McDonough J.P. (2006). METS: Standardized encoding for digital library objects. Int. J. Digital Libraries 6(2): 148–158. doi:10.1007/s00799-005-0132-1 CrossRefGoogle Scholar
  68. 68.
    Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating topic-driven web crawlers. In: SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241–249 (2001). doi:10.1145/383952.383995Google Scholar
  69. 69.
    Mitra, N.: SOAP version 1.2 part 0: Primer. Tech. rep., W3C (2003). URL: http://www.w3.org/TR/soap12-part0/Google Scholar
  70. 70.
    Nelson, M.L., Allen, B.D.: Object persistence and availability in digital libraries. D-Lib Magaz. 8(1) (2002). doi:10.1045/ january2002-nelsonGoogle Scholar
  71. 71.
    Nelson, M.L., Bollen, J., Manepalli, G., Haq, R.: Archive ingest and handling test: The Old Dominion University approach. D-Lib Magaz. 11(12) (2005). doi:10.1045/december2005-nelsonGoogle Scholar
  72. 72.
    Nelson, M.L., Smith, J.A., del Campo, I.G., Van de Sompel, H., Liu, X.: Efficient, automatic web resource harvesting. In: WIDM ’06: Proceedings of the 8th Annual ACM International Workshop on Web Information and Data Management (2006)Google Scholar
  73. 73.
    Nelson, M.L., Van de Sompel, H., Liu, X., Harrison, T.L.: mod_oai: an Apache module for metadata harvesting. Tech. rep., Old Dominion University (2005). ArXiv cs.DL/0503069Google Scholar
  74. 74.
    Nelson, M.L., Van de Sompel, H., Liu, X., Harrison, T.L., McFarland, N.: mod_oai: an Apache module for metadata harvesting. In: ECDL ’05: Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries, pp. 509–510 (2005)Google Scholar
  75. 75.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web? The evolution of the Web from a search engine perspective. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pp. 1–12 (2004). doi:10.1145/988672.988674Google Scholar
  76. 76.
    Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: JCDL ’05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 100–109 (2005). doi:10.1145/1065385.1065407Google Scholar
  77. 77.
    Pandey, S., Roy, S., Olston, C., Cho, J., Chakrabarti, S.: Shuffling a stacked deck: the case for partially randomized ranking of search engine results. In: VLDB ’05: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 781–792 (2005)Google Scholar
  78. 78.
    Park, S.T., Pennock, D.M., Giles, C.L., Krovetz, R.: Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Trans. Inf. Systems 22(4), 540–572 (2004). doi:10.1145/1028099.1028101Google Scholar
  79. 79.
    Paskin N. (2002). Digital object identifiers. Inf. Services Use 22(2–3): 97–112 Google Scholar
  80. 80.
    Payette, S., Staples, T.: The Mellon Fedora project. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 406–421 (2002)Google Scholar
  81. 81.
    Phelps, T.A., Wilensky, R.: Robust hyperlinks cost just five words each. Tech. Rep. UCB/CSD-00-1091, EECS Department, University of California, Berkeley (2000)Google Scholar
  82. 82.
    Plank J.S. (1997). A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw. Practice Experience 27(9): 995–1012 CrossRefGoogle Scholar
  83. 83.
    Postel, J.B.: Simple mail transfer protocol, Internet RFC-821 (1982)Google Scholar
  84. 84.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 129–138 (2001)Google Scholar
  85. 85.
    Rajasekar, A., Wan, M., Moore, R.: MySRB & SRB: Components of a data grid. In: HPDC ’02: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing HPDC-11 20002 (HPDC’02), pp. 301–310 (2002)Google Scholar
  86. 86.
    Rao H.C., Chen Y. and Chen M. (2001). A proxy-based personal web archiving service. SIGOPS Oper. Systems Rev. 35(1): 61–72. CrossRefGoogle Scholar
  87. 87.
    Rauber, A., Aschenbrenner, A., Witvoet, O.: Austrian on-line archive processing: Analyzing archives of the World Wide Web. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2002), pp. 16–31. Rome, Italy (2002)Google Scholar
  88. 88.
    Rhea, S., Wells, C., Eaton, P., Geels, D., Zhao, B., Weatherspoon, H., Kubiatowicz, J.: Maintenance-free global data storage. IEEE Internet Comput. 5(5), 40–49 (2001). doi:10.1109/4236.957894Google Scholar
  89. 89.
    RLG: Preserving Digital Information: Report of the Task Force on Archiving of Digital Information. http://www.rlg.org/ArchTF/ (1996)Google Scholar
  90. 90.
    Rothenberg, J.: Avoiding technological quicksand: finding a viable technical foundation for digital preservation (1999). http://www.clir.org/PUBS/abstract/pub77.htmlGoogle Scholar
  91. 91.
    Rowstron, A., Druschel, P.: Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In: SOSP ’01: Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, pp. 188–201 (2001). doi:10.1145/502034.502053Google Scholar
  92. 92.
    Schonfeld, U., Bar-Yossef, Z., Keidar, I.: Do not crawl in the DUST: different URLs with similar text. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pp. 1015–1016 (2006). doi:10.1145/1135777.1135992Google Scholar
  93. 93.
    Shirky, C.: Aiht: Conceptual issues from practical tests. D-Lib Magaz. 11(12) (2005). doi:10.1045/december2005-shirkyGoogle Scholar
  94. 94.
    Shivakumar, N., Garcia-Molina, H.: Finding near-replicas of documents and servers on the web. In: WebDB ’98: Selected Papers from the International Workshop on The World Wide Web and Databases, pp. 204–212 (1999)Google Scholar
  95. 95.
    Smith, J.A., Klein, M., Nelson, M.L.: Repository replication using NNTP and SMTP. In: ECDL ’06: Proceedings of the 10th European Conference on Research and Advanced Technology for Digital Libraries (2006)Google Scholar
  96. 96.
    Smith, J.A., Klein, M., Nelson, M.L.: Repository replication using NNTP and SMTP. Tech. Rep. arXiv cs.DL/0606008 (2006). http://arxiv.org/abs/cs.DL/0606008Google Scholar
  97. 97.
    Smith, J.A., McCown, F., Nelson, M.L.: Observed web robot behavior on decaying web subsites. D-Lib Magaz. 12(2) (2006). doi:10.1045/february2006-smithGoogle Scholar
  98. 98.
    Spinellis D. (2003). The decay and failures of web references. Commun. ACM 46(1): 71–77. doi:10.1145/602421.602422 CrossRefMathSciNetGoogle Scholar
  99. 99.
    Tansley, R., Bass, M., Stuve, D., Branschofsky, M., Chudnov, D., McClellan, G., Smith, M.: The DSpace institutional digital repository system: current functionality. In: JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 87–97 (2003)Google Scholar
  100. 100.
    Thati, P., Chang, P.H., Agha, G.: Crawlets: Agents for high performance web search engines. In: MA 2001: Proceedings of the 5th International Conference on Mobile Agents, vol. 2240 (2001)Google Scholar
  101. 101.
    Van de Sompel, H., Lagoze, C.: Notes from the interoperability front: A progress report on the Open Archives Initiative. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 144–157 (2002)Google Scholar
  102. 102.
    Van de Sompel, H., Nelson, M.L., Lagoze, C., Warner, S.: Resource harvesting within the OAI-PMH framework. D-Lib Magaz. 10(12) (2004). doi:10.1045/december2004-vandesompelGoogle Scholar
  103. 103.
    Van de Walle, R., Burnett, I., Dury, G.: ISO/IEC 21000-2 Digital Item Declaration (Output Document of the 70th MPEG Meeting, Palma De Mallorca, Spain, No. ISO/IEC JTC1/SC29/WG11/N6770) (2004)Google Scholar
  104. 104.
    Young, J.: OAIHarvester2. http://www.oclc.org/research/ software/oai/harvester2.htm (2005)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Michael L. Nelson
    • 1
  • Frank McCown
    • 1
  • Joan A. Smith
    • 1
  • Martin Klein
    • 1
  1. 1.Old Dominion UniversityNorfolkUSA

Personalised recommendations