Advertisement

Information Retrieval

, Volume 8, Issue 1, pp 5–24 | Cite as

Index-Based Persistent Document Identifiers

  • Diomidis Spinellis
Article
  • 68 Downloads

Abstract

The infrastructure of a typical search engine can be used to calculate and resolve persistent document identifiers: a string that can uniquely identify and locate a document on the Internet without reference to its original location (URL). Bookmarking a document using such an identifier allows its retrieval even if the document's URL, and, in many cases, its contents change. Web client applications can offer facilities for users to bookmark a page by reference to a search engine and the persistent identifier instead of the original URL. The identifiers are calculated using a global Internet term index; a document's unique identifier consists of a word or word combination that occurs uniquely in the specific document. We use a genetic algorithm to locate a minimal unique document identifier: the shortest word or word combination that will locate the document. We tested our approach by implementing tools for indexing a document collection, calculating the persistent identifiers, performing queries, and distributing the computation and storage load among many computers.

URL search engine persistency index distributed approach 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ashman H (2000) Electronic document addressing: Dealing with change. ACM Computing Surveys, 32(3):201–212.Google Scholar
  2. Barabási A-L, Albert R and Jeong H (2000) Scale-free characteristics of random networks: The topology of the world-wide web, Physica, A (281):69–77.Google Scholar
  3. Berners-Lee T, Masinter L and McCahill M (1994) RFC 1738: Uniform Resource Locators (URL), (Dec.). Updated by RFC1808, RFC2368 (Fielding, 1995; Hoffman et al. 1998). Status: Proposed Standard.Google Scholar
  4. Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117, Seventh International World Wide Web Conference Proceedings (WWW7).CrossRefGoogle Scholar
  5. Cerny V (1985) Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45:41–51.Google Scholar
  6. Chankhunthod A, Danzing PB, Neerdaels C, Schwartz MF and Worrell KJ (1996) A hierarchical internet object cache. In USENIX Technical Conference Proceedings, Usenix Association, Berkeley, CA.Google Scholar
  7. Fielding R (1995) RFC 1808: Relative Uniform Resource Locators (June). Updates RFC1738 (Berners-Lee et al., 1994). Updated by RFC2368 (Hoffman et al., 1998). Status: Proposed Standard.Google Scholar
  8. Forrest S (1996) Genetic algorithms ACM Computing Surveys, 28(1):77–83.Google Scholar
  9. Garey MR and Johnson DS (1979) Computers and intractability: A guide to the Theory of NP-Completeness. W.H. Freeman and Company.Google Scholar
  10. Glover F (1990) Tabu search—Part I, ORSA Journal on Computing, I:190–206.Google Scholar
  11. Goldberg DE (1989) Genetic algorithms: In Search of Optimization and Machine Learning, Addison-Wesley.Google Scholar
  12. Goldberg DE (1994) Genetic and evolutionary algorithms come of age. Communications of the ACM, 37(3):113–119.Google Scholar
  13. Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms, IEEE Transactions on Systems, Man, and Cybernetics, 16(1):122–128.Google Scholar
  14. Hitchcock S, Carr L, Harris S, Hey JMN and Hall W(1999) Citation linking: Improving access to online journals. In Proceedings of the 2nd ACM International Conference on Digital Libraries, pp. 115–122.Google Scholar
  15. Hoffman P, Masinter L and Zawinski J (1998) RFC 2368: The mailto URL Scheme, (July). Updates RFC1738, RFC1808 (Berners-Lee et al., 1994; Fielding, 1995). Status: Proposed Standard.Google Scholar
  16. Holland JH (1975) Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI.Google Scholar
  17. Karr CL (1993) Genetic algorithms for modelling, design, and process control. CIKM '93. Proceedings of the Second International Conference on Information and Knowledge Management, ACM, pp. 233–238.Google Scholar
  18. Knuth DE (1981) The Art of Computer Programming, 2nd edition, Vol. 2. Seminumerical Algorithms, Addison-Wesley, Reading, MA.Google Scholar
  19. Koulamas C, Antony SR and Jaen R (1994) A survey of simulated annealing applications to operations research problems, Omega International Journal of Management Science, 22(1):41–56.Google Scholar
  20. Lawrence S and Giles CL (1999) Searching the web: General and scientific information access. IEEE Communications, 37(1):116–122.CrossRefGoogle Scholar
  21. Lawrence S, Giles CL and Bollacker K (1999) Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67–71.Google Scholar
  22. Lawrence S, Pennock DM, Flake GW, Coetzee FM, Glover E, Nielsen F Å, Kruger A and Giles CL (2001) Persistence of web references in scientific research. IEEE Computer, 34(2):26–31.Google Scholar
  23. Moffat A (1992) Economical inversion of large text files. Computing Systems, 5(2):125–139.Google Scholar
  24. Park S-T, Pennock D, Giles L and Krovetz R (2002) Analysis of lexical signatures for finding lost or related documents, Proceedings of the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. New York, ACM Press, for ACM, pp. 11–18.Google Scholar
  25. Phelps TA and Wilensky R (2000) Robust hyperlinks: Cheap, everywhere. In: Proceedings of Digital Documents and Electronic Publishing (DDEP00).Google Scholar
  26. Pitkow JE (1999) Summary of WWW characterizations. World Wide Web, 2(1–2):3–13.Google Scholar
  27. Schneier B (1996) Applied Cryptography, 2nd edition, Wiley, New York.Google Scholar
  28. Spinellis D (1994) The design and implementation of a legal text database. In: Karagiannis D, Ed., DEXA 94: 5th International Conference on Database and Expert Systems Applications, Springer-Verlag, pp. 348. Lecture Notes in Computer Science 856.Google Scholar
  29. Spinellis D (2003) The decay and failures of web references, Communications of the ACM, 46(1):71–77.Google Scholar
  30. Takeda MKK (2000) Information retrieval on the web. ACM Computing Surveys, 32(2):144–173.Google Scholar
  31. Van Laarhoven PJM and Aarts EHL (1987) Simulated annealing: Theory and applications, D. Reidel, Dordrecht, The Nethelands.Google Scholar
  32. Wagner M (2001) Google defies dot-com downturn, Tech Web, April, Online http://www.techweb.com/wire/story/TWB20010427S0011 (current June 2002).Google Scholar
  33. Zobel J, Heinz S and Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Information Processing Letters, 80(6):271–277.Google Scholar

Copyright information

© Kluwer Academic Publishers 2005

Authors and Affiliations

  • Diomidis Spinellis
    • 1
  1. 1.Department Management Science and TechnologyAthens University of Economics and BusinessGreece

Personalised recommendations