Skip to main content
Log in

Some computational tools for digital archive and metadata maintenance

  • Published:
BIT Numerical Mathematics Aims and scope Submit manuscript

Abstract

Computational tools are a mainstay of current search and recommendation technology. But modern digital archives are astonishingly diverse collections of older digitized material and newer “born digital” content. Finding interesting material in these archives is still challenging. The material often lacks appropriate annotation—or metadata—so that people can find the most interesting material. We describe four computational tools we developed to aid in the processing and maintenance of large digital archives. The first is an improvement to a graph layout algorithm for graphs with hundreds of thousands of nodes. The second is a new algorithm for matching databases with links among the objects, also known as a network alignment problem. The third is an optimization heuristic to disambiguate a set of geographic references in a book. And the fourth is a technique to automatically generate a title from a description.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adai, A.T., Date, S.V., Wieland, S., Marcotte, E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004). doi:10.1016/j.jmb.2004.04.047

    Article  Google Scholar 

  2. Bayati, M., Gerritsen, M., Gleich, D.F., Saberi, A., Wang, Y.: Algorithms for large, sparse network alignment problems. In: Proceedings of the 9th IEEE International Conference on Data Mining, pp. 705–710 (2009). doi:10.1109/ICDM.2009.135

    Chapter  Google Scholar 

  3. Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Rev. 46(4), 647–666 (2004). doi:10.1137/S0036144502415960

    Article  MATH  MathSciNet  Google Scholar 

  4. Chew, P.A., Bader, B.W., Kolda, T.G., Abdelali, A.: Cross-language information retrieval using PARAFAC2. In: Berkhin, P., Caruana, R., Wu, X., Gaffney, S. (eds.) Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, pp. 143–152. Association for Computing Machinery, ACM Press, San Jose (2007). doi:10.1145/1281192.1281211

    Chapter  Google Scholar 

  5. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(3), 265–298 (2004). doi:10.1142/S0218001404003228

    Article  Google Scholar 

  6. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002). http://gate.ac.uk/sale/acl02/acl-main.pdf

    Google Scholar 

  7. de Groat, G.: Future directions in metadata remediation for metadata aggregators. Tech. rep., Digital Library Federation (2009)

  8. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990). doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

    Article  Google Scholar 

  9. Ehrig, M., Staab, S.: QOM—quick ontology mapping. In: Third International Semantic Web Conference. LNCS, vol. 3298, pp. 683–697 (2004). doi:10.1007/b102467

    Google Scholar 

  10. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Morristown (2005). doi:10.3115/1219840.1219885

    Chapter  Google Scholar 

  11. Fraikin, C., Nesterov, Y., Dooren, P.V.: A gradient-type algorithm optimizing the coupling between matrices. Linear Algebra Appl. 429(5–6), 1229–1242 (2008). doi:10.1016/j.laa.2007.10.015

    Article  MATH  MathSciNet  Google Scholar 

  12. Fraikin, C., Nesterov, Y., Van Dooren, P.: Optimizing the coupling between two isometric projections of matrices. SIAM J. Matrix Anal. Appl. 30(1), 324–345 (2008). doi:10.1137/050643878

    Article  MATH  MathSciNet  Google Scholar 

  13. Göbel, F., Jagers, A.A.: Random walks on graphs. Stoch. Process. Appl. 2(4), 311–336 (1974). doi:10.1016/0304-4149(74)90001-5

    Article  MATH  Google Scholar 

  14. Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009). doi:10.1109/MIS.2009.36

    Article  Google Scholar 

  15. Harshman, R.A.: PARAFAC2: Mathematical and technical notes. UCLA Work. Pap. Phon. 22, 30–44 (1972)

    Google Scholar 

  16. Heer, J., Bostock, M.: Crowdsourcing graphical perception: using Mechanical Turk to assess visualization design. In: CHI ’10: Proceedings of the 28th International Conference on Human Factors in Computing Systems, pp. 203–212. ACM, New York (2010). doi:10.1145/1753326.1753357

    Google Scholar 

  17. Higham, N.J.: Handbook of Writing for the Mathematical Sciences. SIAM, Philadelphia (1998)

    Book  MATH  Google Scholar 

  18. Hu, W., Qu, Y., Cheng, G.: Matching large ontologies: A divide-and-conquer approach. Data Knowl. Eng. 67(1), 140–160 (2008). doi:10.1016/j.datak.2008.06.003

    Article  Google Scholar 

  19. Huberman, B.A., Romero, D.M., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14(1), Online (2008). URL http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/2063

  20. Java, A.: Twitter social network analysis. UMBC ebquity blog (2007). URL http://ebiquity.umbc.edu/blogger/2007/04/19/twitter-social-network-analysis/

  21. Java, A., Song, X., Finin, T., Tseng, B.: Why we Twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007). doi:10.1145/1348549.1348556

    Chapter  Google Scholar 

  22. Jia, Y., Hoberock, J., Garland, M., Hart, J.: On the visualization of social and other scale-free networks. IEEE Trans. Vis. Comput. Graph. 41(6), 1285–1292 (2008). doi:10.1109/TVCG.2008.151

    Google Scholar 

  23. Karypis, G.: CLUTO—a clustering toolkit. Tech. Rep. 02-017, University of Minnesota, Department of Computer. Science (2002). URL http://glaros.dtc.umn.edu/gkhome/views/cluto/

  24. Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process. 35(3), 400–401 (1987)

    Article  Google Scholar 

  25. Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: CHI ’08: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 453–456. ACM, New York (2008). doi:10.1145/1357054.1357127

    Chapter  Google Scholar 

  26. Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95, vol. 1, pp. 181–184 (1995). doi:10.1109/ICASSP.1995.479394

    Chapter  Google Scholar 

  27. Körner, C., Benz, D., Hotho, A., Strohmaier, M., Stumme, G.: Stop thinking, start tagging: tag semantics emerge from collaborative verbosity. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 521–530. ACM, New York (2010). doi:10.1145/1772690.1772744

    Chapter  Google Scholar 

  28. Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about Twitter. In: WOSP ’08: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008). doi:10.1145/1397735.1397741

    Chapter  Google Scholar 

  29. Kuny, T.: A digital dark ages? Challenges in the preservation of electronic information. In: 63rd International Federation of Library Associations and Institutions Council and General Conference (IFLA1997) (1997). URL http://ifla.queenslibrary.org/iv/ifla63/63kuny1.pdf

    Google Scholar 

  30. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, New York (2010). doi:10.1145/1772690.1772751

    Chapter  Google Scholar 

  31. Levy, S.: How Google’s algorithm rules the web. Wired Mag. 18(3) (2010). http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1

  32. McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005). doi:10.1145/1105664.1105679

    Article  Google Scholar 

  33. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, p. 117. IEEE Computer Society, San Jose (2002)

    Chapter  Google Scholar 

  34. Ninove, L.: Dominant vectors of nonnegative matrices: Application to information extraction in large graphs. Ph.D. thesis, Université Catholique de Louvain (2008)

  35. Rafiei, D., Curial, S.: Effectively visualizing large networks through sampling. Vis. Conf., IEEE 0, 48 (2005). doi:10.1109/VIS.2005.25

    Google Scholar 

  36. Rajaraman, A.: More data usually beats better algorithms. Datawocky Blog (2008). URL http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

  37. Ross, J., Irani, L., Silberman, M.S., Zaldivar, A., Tomlinson, B.: Who are the crowdworkers? Shifting demographics in Mechanical Turk. In: CHI EA ’10: Proceedings of the 28th International Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2863–2872. ACM, New York (2010). doi:10.1145/1753846.1753873

    Chapter  Google Scholar 

  38. Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983). doi:10.1016/0378-8733(83)90028-X

    Article  MathSciNet  Google Scholar 

  39. Silberman, M.S., Ross, J., Irani, L., Tomlinson, B.: Sellers’ problems in human computation markets. In: HCOMP ’10: Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 18–21. ACM, New York (2010). doi:10.1145/1837885.1837891

    Chapter  Google Scholar 

  40. Surowiecki, J.: The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday (2005)

  41. Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions, pp. 33–40. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1119282.1119287

    Chapter  Google Scholar 

  42. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1073445.1073478

    Chapter  Google Scholar 

  43. Various: The MARC Standard. URL http://www.loc.gov/marc (2007). Accessed on 17 September 2007

  44. Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: WSDM ’10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM, New York (2010). doi:10.1145/1718487.1718520

    Chapter  Google Scholar 

  45. Yee, K.P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: CHI ’03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 401–408. ACM, New York (2003). doi:10.1145/642611.642681

    Google Scholar 

  46. Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on wikipedia for semantic relatedness. In: TextGraphs-4: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics, Morristown (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Margot Gerritsen.

Additional information

Communicated by Axel Ruhe.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04- 94AL85000.

The majority of David’s work was completed while at Stanford University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gleich, D.F., Wang, Y., Meng, X. et al. Some computational tools for digital archive and metadata maintenance. Bit Numer Math 51, 127–154 (2011). https://doi.org/10.1007/s10543-011-0324-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10543-011-0324-6

Keywords

Mathematics Subject Classification (2000)

Navigation