International Journal on Digital Libraries

, Volume 6, Issue 4, pp 313–326 | Cite as

Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support

  • Dongwon LeeEmail author


Metadata (i.e., data describing about data) of digital objects plays an important role in digital libraries and archives, and thus its quality needs to be maintained well. However, as digital objects evolve over time, their associated metadata evolves as well, causing a consistency issue. Since various functionalities of applications containing digital objects (e.g., digital library, public image repository) are based on metadata, evolving metadata directly affects the quality of such applications. To make matters worse, modern data applications are often large-scale (having millions of digital objects) and are constructed by software agents or crawlers (thus often having automatically populated and erroneous metadata). In such an environment, it is challenging to quickly and accurately identify evolving metadata and fix them (if needed) while applications keep running. Despite the importance and implications of the problem, the conventional solutions have been very limited. Most of existing metadata-related approaches either focus on the model and semantics of metadata, or simply keep authority file of some sort for evolving metadata, and never fully exploit its potential usage from the system point of view. On the other hand, the question that we raise in this paper is “when millions of digital objects and their metadata are given, (1) how to quickly identify evolving metadata in various context? and (2) once the evolving metadata are identified, how to incorporate them into the system?” The significance of this paper is that we investigate scalable algorithmic solution toward the identification of evolving metadata and emphasize the role of “systems” for maintenance, and argue that “systems” must keep track of metadata changes pro-actively, and leverage on the learned knowledge in their various services.


Digital preservation Evolving metadata 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)Google Scholar
  2. 2. e Print archive. Scholar
  3. 3.
    Atkins, H., Lyons, C., Ratner, H., Risher, C., Shillum, C., Sidman, D., Stevens, A., Arms, W.: Reference linking with DOIs: a case study. D-Lib Magazine (2000)Google Scholar
  4. 4.
    Bergmark, D., Lagoze, C.: An architecture for automatic reference linking. In: European Conf. on Digital Libraries (ECDL), Darmstadt, Germany (2001)Google Scholar
  5. 5.
    Digital Bibliography and Library Project (DBLP). http://dblp. Scholar
  6. 6.
    Bilenko M., Mooney R., Cohen W., Ravikumar P. and Fienberg S. (2003). Adaptive name-matching in information integration. IEEE Intell. Syst. 18(5): 16–23 CrossRefGoogle Scholar
  7. 7.
    Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: ACM SIGMOD, Santa Barbara (2001)Google Scholar
  8. 8.
    Caplan, P., Arms, W.: Reference linking for journal articles. D-Lib Magaz., 5(7/8) (1999) caplan/07caplan.htmlGoogle Scholar
  9. 9.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: ACM SIGMOD (2003)Google Scholar
  10. 10.
    Cohen W.W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. (TOIS) 18(3): 288–321 CrossRefGoogle Scholar
  11. 11.
    Cruz, J.M.B., Klink, N.J.R., Krichel, T.: Personal data in a large digital library. In: European Conf. on Digital Libraries (ECDL) (2000)Google Scholar
  12. 12.
    Davis, P.T., Elson, D.K., Klavans, J.L.: Methods for precise named entity matching in digital collection. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL) (2003)Google Scholar
  13. 13.
    DCMI. Dublin Core Metadata Initiative. Web page. Scholar
  14. 14.
    Fellegi I.P. and Sunter A.B. (1969). A theory for record linkage. J. Am. Stati. Soc. 64: 1183–1210 CrossRefGoogle Scholar
  15. 15.
    A Library for Support Vector~Machines. http://www.csie.ntu. Scholar
  16. 16.
    Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins for data cleansing and integration in an RDBMS. In: IEEE ICDE, (2003)Google Scholar
  17. 17.
    Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Int’l World Wide Web Conf. (WWW) (2003)Google Scholar
  18. 18.
    Han, H., Giles, C.L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. (2004)Google Scholar
  19. 19.
    Hellman, E.: Scholarly Link Specification Framework (SLinkS), Nov. 1998. Scholar
  20. 20.
    Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD (1995)Google Scholar
  21. 21.
    Hitchcock, S., Brody, T., Gutteridge, C., Carr, L., Hall, W., Harnad, S., Bergmark, D., Lagoze, C.: Open Citation Linking: The Way Forward. D-Lib Magaz. 8(10) (2002)Google Scholar
  22. 22.
    Hitchcock, S., Carr, L., Hall, W., Harris, S., Probets, S., Evans, D., Brailsford, D.: Linking electronic journals: lessons from the open journal project. D-Lib Magaz (1998)Google Scholar
  23. 23.
    Hong, Y., On, B.-W., Lee, D.: System support for name authority control problem in digital libraries: OpenDBLP approach. In: European Conf. on Digital Libraries (ECDL), Bath (2004)Google Scholar
  24. 24.
    Hylton, J.A.: Identifying and Merging Related Bibliographic Records. PhD thesis, Dept. of EECS, MIT, LCS (1996) Technical Report MIT/LCS/TR-678Google Scholar
  25. 25.
    ISI/Science Citation Index. Scholar
  26. 26.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Ame. Stat. Assoc, 84(406) (1989)Google Scholar
  27. 27.
    Lawrence S., Giles C.L. and Bollacker K. (1999). Digital Libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71 CrossRefGoogle Scholar
  28. 28.
    Lee, D., On, B.-W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), Jun. (2005)Google Scholar
  29. 29.
    CiteSeer: Scientific Literature Digital Library. http://citeseer. Scholar
  30. 30.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM KDD, Boston (2000)Google Scholar
  31. 31.
    Miner, R.: Enhancing the Searching of Mathematics, Jun. (2004) Scholar
  32. 32.
    Monge, A.E.: Adaptive detection of approximately duplicate database records and the database integration approach to information discovery. PhD Thesis, University of California, San Diego (1997)Google Scholar
  33. 33.
    OCLC. Persistent Uniform Resource Locator. Web page. Scholar
  34. 34.
    Library of Congress. LC Digital Repository Development Core Metadata Elements Introduction Page. Web page, (2004) Scholar
  35. 35.
    On, B.-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun (2006)Google Scholar
  36. 36.
    On, B.-W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. (2005)Google Scholar
  37. 37.
    Paskin, N.: DOI: a 2003 Progress Report. D-Lib Magaz 9(6) (2003)Google Scholar
  38. 38.
    Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge (2003)Google Scholar
  39. 39.
    Petinot, Y., Teregowda, P.B., Han, H., Giles, C.L., Lawrence, S., Rangaswamy, A., Pal, N.: eBizSearch: An OAI-compliant digital library for ebusiness. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Houston, May (2003)Google Scholar
  40. 40.
    The Open Citation Project. Scholar
  41. 41.
    SecondString: Open source Java-based Package~of Approximate String-Matching. Scholar
  42. 42.
    Synman, M.M.M., van Rensburg, M.J.: Revolutionizing Name Authority Control. In ACM Int’l Conference on Digital Libraries (DL) (2000)Google Scholar
  43. 43.
    Takasu, A.: Bibliographic attribute extraction from erroneous references based on a statistical model. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Houston, May (2003)Google Scholar
  44. 44.
    Tejada S., Knoblock C.A. and Minton S. (2001). Learning object identification rules for information integration. Inf. Sys. 26(8): 607–633 CrossRefGoogle Scholar
  45. 45.
    Tillett, B.: FRBR: A conceptual model for the bibliographic universe. Library of Congress Cataloging Distribution Service, 2004. Scholar
  46. 46.
    VIAF. Virtual International Authority File (VIAF) project. Web page. Scholar
  47. 47.
    Warnner, J.W., Brown, E.W.: Automated name authority control. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL) (2001)Google Scholar
  48. 48.
    Winkler, W.E.: The state of record linkage and current research problems. Technical report, US Bureau of the Census, Apr (1999)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  1. 1.The Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations