Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Some computational tools for digital archive and metadata maintenance


Computational tools are a mainstay of current search and recommendation technology. But modern digital archives are astonishingly diverse collections of older digitized material and newer “born digital” content. Finding interesting material in these archives is still challenging. The material often lacks appropriate annotation—or metadata—so that people can find the most interesting material. We describe four computational tools we developed to aid in the processing and maintenance of large digital archives. The first is an improvement to a graph layout algorithm for graphs with hundreds of thousands of nodes. The second is a new algorithm for matching databases with links among the objects, also known as a network alignment problem. The third is an optimization heuristic to disambiguate a set of geographic references in a book. And the fourth is a technique to automatically generate a title from a description.

This is a preview of subscription content, log in to check access.


  1. 1.

    Adai, A.T., Date, S.V., Wieland, S., Marcotte, E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004). doi:10.1016/j.jmb.2004.04.047

  2. 2.

    Bayati, M., Gerritsen, M., Gleich, D.F., Saberi, A., Wang, Y.: Algorithms for large, sparse network alignment problems. In: Proceedings of the 9th IEEE International Conference on Data Mining, pp. 705–710 (2009). doi:10.1109/ICDM.2009.135

  3. 3.

    Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Rev. 46(4), 647–666 (2004). doi:10.1137/S0036144502415960

  4. 4.

    Chew, P.A., Bader, B.W., Kolda, T.G., Abdelali, A.: Cross-language information retrieval using PARAFAC2. In: Berkhin, P., Caruana, R., Wu, X., Gaffney, S. (eds.) Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, pp. 143–152. Association for Computing Machinery, ACM Press, San Jose (2007). doi:10.1145/1281192.1281211

  5. 5.

    Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(3), 265–298 (2004). doi:10.1142/S0218001404003228

  6. 6.

    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002).

  7. 7.

    de Groat, G.: Future directions in metadata remediation for metadata aggregators. Tech. rep., Digital Library Federation (2009)

  8. 8.

    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990). doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

  9. 9.

    Ehrig, M., Staab, S.: QOM—quick ontology mapping. In: Third International Semantic Web Conference. LNCS, vol. 3298, pp. 683–697 (2004). doi:10.1007/b102467

  10. 10.

    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Morristown (2005). doi:10.3115/1219840.1219885

  11. 11.

    Fraikin, C., Nesterov, Y., Dooren, P.V.: A gradient-type algorithm optimizing the coupling between matrices. Linear Algebra Appl. 429(5–6), 1229–1242 (2008). doi:10.1016/j.laa.2007.10.015

  12. 12.

    Fraikin, C., Nesterov, Y., Van Dooren, P.: Optimizing the coupling between two isometric projections of matrices. SIAM J. Matrix Anal. Appl. 30(1), 324–345 (2008). doi:10.1137/050643878

  13. 13.

    Göbel, F., Jagers, A.A.: Random walks on graphs. Stoch. Process. Appl. 2(4), 311–336 (1974). doi:10.1016/0304-4149(74)90001-5

  14. 14.

    Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009). doi:10.1109/MIS.2009.36

  15. 15.

    Harshman, R.A.: PARAFAC2: Mathematical and technical notes. UCLA Work. Pap. Phon. 22, 30–44 (1972)

  16. 16.

    Heer, J., Bostock, M.: Crowdsourcing graphical perception: using Mechanical Turk to assess visualization design. In: CHI ’10: Proceedings of the 28th International Conference on Human Factors in Computing Systems, pp. 203–212. ACM, New York (2010). doi:10.1145/1753326.1753357

  17. 17.

    Higham, N.J.: Handbook of Writing for the Mathematical Sciences. SIAM, Philadelphia (1998)

  18. 18.

    Hu, W., Qu, Y., Cheng, G.: Matching large ontologies: A divide-and-conquer approach. Data Knowl. Eng. 67(1), 140–160 (2008). doi:10.1016/j.datak.2008.06.003

  19. 19.

    Huberman, B.A., Romero, D.M., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14(1), Online (2008). URL

  20. 20.

    Java, A.: Twitter social network analysis. UMBC ebquity blog (2007). URL

  21. 21.

    Java, A., Song, X., Finin, T., Tseng, B.: Why we Twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007). doi:10.1145/1348549.1348556

  22. 22.

    Jia, Y., Hoberock, J., Garland, M., Hart, J.: On the visualization of social and other scale-free networks. IEEE Trans. Vis. Comput. Graph. 41(6), 1285–1292 (2008). doi:10.1109/TVCG.2008.151

  23. 23.

    Karypis, G.: CLUTO—a clustering toolkit. Tech. Rep. 02-017, University of Minnesota, Department of Computer. Science (2002). URL

  24. 24.

    Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process. 35(3), 400–401 (1987)

  25. 25.

    Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: CHI ’08: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 453–456. ACM, New York (2008). doi:10.1145/1357054.1357127

  26. 26.

    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95, vol. 1, pp. 181–184 (1995). doi:10.1109/ICASSP.1995.479394

  27. 27.

    Körner, C., Benz, D., Hotho, A., Strohmaier, M., Stumme, G.: Stop thinking, start tagging: tag semantics emerge from collaborative verbosity. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 521–530. ACM, New York (2010). doi:10.1145/1772690.1772744

  28. 28.

    Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about Twitter. In: WOSP ’08: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008). doi:10.1145/1397735.1397741

  29. 29.

    Kuny, T.: A digital dark ages? Challenges in the preservation of electronic information. In: 63rd International Federation of Library Associations and Institutions Council and General Conference (IFLA1997) (1997). URL

  30. 30.

    Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, New York (2010). doi:10.1145/1772690.1772751

  31. 31.

    Levy, S.: How Google’s algorithm rules the web. Wired Mag. 18(3) (2010).

  32. 32.

    McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005). doi:10.1145/1105664.1105679

  33. 33.

    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, p. 117. IEEE Computer Society, San Jose (2002)

  34. 34.

    Ninove, L.: Dominant vectors of nonnegative matrices: Application to information extraction in large graphs. Ph.D. thesis, Université Catholique de Louvain (2008)

  35. 35.

    Rafiei, D., Curial, S.: Effectively visualizing large networks through sampling. Vis. Conf., IEEE 0, 48 (2005). doi:10.1109/VIS.2005.25

  36. 36.

    Rajaraman, A.: More data usually beats better algorithms. Datawocky Blog (2008). URL

  37. 37.

    Ross, J., Irani, L., Silberman, M.S., Zaldivar, A., Tomlinson, B.: Who are the crowdworkers? Shifting demographics in Mechanical Turk. In: CHI EA ’10: Proceedings of the 28th International Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2863–2872. ACM, New York (2010). doi:10.1145/1753846.1753873

  38. 38.

    Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983). doi:10.1016/0378-8733(83)90028-X

  39. 39.

    Silberman, M.S., Ross, J., Irani, L., Tomlinson, B.: Sellers’ problems in human computation markets. In: HCOMP ’10: Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 18–21. ACM, New York (2010). doi:10.1145/1837885.1837891

  40. 40.

    Surowiecki, J.: The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday (2005)

  41. 41.

    Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions, pp. 33–40. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1119282.1119287

  42. 42.

    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1073445.1073478

  43. 43.

    Various: The MARC Standard. URL (2007). Accessed on 17 September 2007

  44. 44.

    Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: WSDM ’10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM, New York (2010). doi:10.1145/1718487.1718520

  45. 45.

    Yee, K.P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: CHI ’03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 401–408. ACM, New York (2003). doi:10.1145/642611.642681

  46. 46.

    Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on wikipedia for semantic relatedness. In: TextGraphs-4: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics, Morristown (2009)

Download references

Author information

Correspondence to Margot Gerritsen.

Additional information

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04- 94AL85000.

The majority of David’s work was completed while at Stanford University.

Communicated by Axel Ruhe.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Gleich, D.F., Wang, Y., Meng, X. et al. Some computational tools for digital archive and metadata maintenance. Bit Numer Math 51, 127–154 (2011).

Download citation


  • Graph layout
  • Metadata remediation
  • Dynamic programming
  • Network alignment

Mathematics Subject Classification (2000)

  • 05C50
  • 05C85
  • 68T50
  • 90C39