BIT Numerical Mathematics

, Volume 51, Issue 1, pp 127–154 | Cite as

Some computational tools for digital archive and metadata maintenance

  • David F. Gleich
  • Ying Wang
  • Xiangrui Meng
  • Farnaz Ronaghi
  • Margot Gerritsen
  • Amin Saberi
Article

Abstract

Computational tools are a mainstay of current search and recommendation technology. But modern digital archives are astonishingly diverse collections of older digitized material and newer “born digital” content. Finding interesting material in these archives is still challenging. The material often lacks appropriate annotation—or metadata—so that people can find the most interesting material. We describe four computational tools we developed to aid in the processing and maintenance of large digital archives. The first is an improvement to a graph layout algorithm for graphs with hundreds of thousands of nodes. The second is a new algorithm for matching databases with links among the objects, also known as a network alignment problem. The third is an optimization heuristic to disambiguate a set of geographic references in a book. And the fourth is a technique to automatically generate a title from a description.

Keywords

Graph layout Metadata remediation Dynamic programming Network alignment 

Mathematics Subject Classification (2000)

05C50 05C85 68T50 90C39 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adai, A.T., Date, S.V., Wieland, S., Marcotte, E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004). doi:10.1016/j.jmb.2004.04.047 CrossRefGoogle Scholar
  2. 2.
    Bayati, M., Gerritsen, M., Gleich, D.F., Saberi, A., Wang, Y.: Algorithms for large, sparse network alignment problems. In: Proceedings of the 9th IEEE International Conference on Data Mining, pp. 705–710 (2009). doi:10.1109/ICDM.2009.135 CrossRefGoogle Scholar
  3. 3.
    Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Rev. 46(4), 647–666 (2004). doi:10.1137/S0036144502415960 CrossRefMATHMathSciNetGoogle Scholar
  4. 4.
    Chew, P.A., Bader, B.W., Kolda, T.G., Abdelali, A.: Cross-language information retrieval using PARAFAC2. In: Berkhin, P., Caruana, R., Wu, X., Gaffney, S. (eds.) Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, pp. 143–152. Association for Computing Machinery, ACM Press, San Jose (2007). doi:10.1145/1281192.1281211 CrossRefGoogle Scholar
  5. 5.
    Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(3), 265–298 (2004). doi:10.1142/S0218001404003228 CrossRefGoogle Scholar
  6. 6.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002). http://gate.ac.uk/sale/acl02/acl-main.pdf Google Scholar
  7. 7.
    de Groat, G.: Future directions in metadata remediation for metadata aggregators. Tech. rep., Digital Library Federation (2009) Google Scholar
  8. 8.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990). doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 CrossRefGoogle Scholar
  9. 9.
    Ehrig, M., Staab, S.: QOM—quick ontology mapping. In: Third International Semantic Web Conference. LNCS, vol. 3298, pp. 683–697 (2004). doi:10.1007/b102467 Google Scholar
  10. 10.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Morristown (2005). doi:10.3115/1219840.1219885 CrossRefGoogle Scholar
  11. 11.
    Fraikin, C., Nesterov, Y., Dooren, P.V.: A gradient-type algorithm optimizing the coupling between matrices. Linear Algebra Appl. 429(5–6), 1229–1242 (2008). doi:10.1016/j.laa.2007.10.015 CrossRefMATHMathSciNetGoogle Scholar
  12. 12.
    Fraikin, C., Nesterov, Y., Van Dooren, P.: Optimizing the coupling between two isometric projections of matrices. SIAM J. Matrix Anal. Appl. 30(1), 324–345 (2008). doi:10.1137/050643878 CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Göbel, F., Jagers, A.A.: Random walks on graphs. Stoch. Process. Appl. 2(4), 311–336 (1974). doi:10.1016/0304-4149(74)90001-5 CrossRefMATHGoogle Scholar
  14. 14.
    Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009). doi:10.1109/MIS.2009.36 CrossRefGoogle Scholar
  15. 15.
    Harshman, R.A.: PARAFAC2: Mathematical and technical notes. UCLA Work. Pap. Phon. 22, 30–44 (1972) Google Scholar
  16. 16.
    Heer, J., Bostock, M.: Crowdsourcing graphical perception: using Mechanical Turk to assess visualization design. In: CHI ’10: Proceedings of the 28th International Conference on Human Factors in Computing Systems, pp. 203–212. ACM, New York (2010). doi:10.1145/1753326.1753357 Google Scholar
  17. 17.
    Higham, N.J.: Handbook of Writing for the Mathematical Sciences. SIAM, Philadelphia (1998) CrossRefMATHGoogle Scholar
  18. 18.
    Hu, W., Qu, Y., Cheng, G.: Matching large ontologies: A divide-and-conquer approach. Data Knowl. Eng. 67(1), 140–160 (2008). doi:10.1016/j.datak.2008.06.003 CrossRefGoogle Scholar
  19. 19.
    Huberman, B.A., Romero, D.M., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14(1), Online (2008). URL http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/2063
  20. 20.
    Java, A.: Twitter social network analysis. UMBC ebquity blog (2007). URL http://ebiquity.umbc.edu/blogger/2007/04/19/twitter-social-network-analysis/
  21. 21.
    Java, A., Song, X., Finin, T., Tseng, B.: Why we Twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007). doi:10.1145/1348549.1348556 CrossRefGoogle Scholar
  22. 22.
    Jia, Y., Hoberock, J., Garland, M., Hart, J.: On the visualization of social and other scale-free networks. IEEE Trans. Vis. Comput. Graph. 41(6), 1285–1292 (2008). doi:10.1109/TVCG.2008.151 Google Scholar
  23. 23.
    Karypis, G.: CLUTO—a clustering toolkit. Tech. Rep. 02-017, University of Minnesota, Department of Computer. Science (2002). URL http://glaros.dtc.umn.edu/gkhome/views/cluto/
  24. 24.
    Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process. 35(3), 400–401 (1987) CrossRefGoogle Scholar
  25. 25.
    Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: CHI ’08: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 453–456. ACM, New York (2008). doi:10.1145/1357054.1357127 CrossRefGoogle Scholar
  26. 26.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95, vol. 1, pp. 181–184 (1995). doi:10.1109/ICASSP.1995.479394 CrossRefGoogle Scholar
  27. 27.
    Körner, C., Benz, D., Hotho, A., Strohmaier, M., Stumme, G.: Stop thinking, start tagging: tag semantics emerge from collaborative verbosity. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 521–530. ACM, New York (2010). doi:10.1145/1772690.1772744 CrossRefGoogle Scholar
  28. 28.
    Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about Twitter. In: WOSP ’08: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008). doi:10.1145/1397735.1397741 CrossRefGoogle Scholar
  29. 29.
    Kuny, T.: A digital dark ages? Challenges in the preservation of electronic information. In: 63rd International Federation of Library Associations and Institutions Council and General Conference (IFLA1997) (1997). URL http://ifla.queenslibrary.org/iv/ifla63/63kuny1.pdf Google Scholar
  30. 30.
    Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, New York (2010). doi:10.1145/1772690.1772751 CrossRefGoogle Scholar
  31. 31.
    Levy, S.: How Google’s algorithm rules the web. Wired Mag. 18(3) (2010). http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1
  32. 32.
    McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005). doi:10.1145/1105664.1105679 CrossRefGoogle Scholar
  33. 33.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, p. 117. IEEE Computer Society, San Jose (2002) CrossRefGoogle Scholar
  34. 34.
    Ninove, L.: Dominant vectors of nonnegative matrices: Application to information extraction in large graphs. Ph.D. thesis, Université Catholique de Louvain (2008) Google Scholar
  35. 35.
    Rafiei, D., Curial, S.: Effectively visualizing large networks through sampling. Vis. Conf., IEEE 0, 48 (2005). doi:10.1109/VIS.2005.25 Google Scholar
  36. 36.
    Rajaraman, A.: More data usually beats better algorithms. Datawocky Blog (2008). URL http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
  37. 37.
    Ross, J., Irani, L., Silberman, M.S., Zaldivar, A., Tomlinson, B.: Who are the crowdworkers? Shifting demographics in Mechanical Turk. In: CHI EA ’10: Proceedings of the 28th International Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2863–2872. ACM, New York (2010). doi:10.1145/1753846.1753873 CrossRefGoogle Scholar
  38. 38.
    Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983). doi:10.1016/0378-8733(83)90028-X CrossRefMathSciNetGoogle Scholar
  39. 39.
    Silberman, M.S., Ross, J., Irani, L., Tomlinson, B.: Sellers’ problems in human computation markets. In: HCOMP ’10: Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 18–21. ACM, New York (2010). doi:10.1145/1837885.1837891 CrossRefGoogle Scholar
  40. 40.
    Surowiecki, J.: The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday (2005) Google Scholar
  41. 41.
    Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions, pp. 33–40. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1119282.1119287 CrossRefGoogle Scholar
  42. 42.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1073445.1073478 CrossRefGoogle Scholar
  43. 43.
    Various: The MARC Standard. URL http://www.loc.gov/marc (2007). Accessed on 17 September 2007
  44. 44.
    Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: WSDM ’10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM, New York (2010). doi:10.1145/1718487.1718520 CrossRefGoogle Scholar
  45. 45.
    Yee, K.P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: CHI ’03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 401–408. ACM, New York (2003). doi:10.1145/642611.642681 Google Scholar
  46. 46.
    Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on wikipedia for semantic relatedness. In: TextGraphs-4: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics, Morristown (2009) CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media B.V. 2011

Authors and Affiliations

  • David F. Gleich
    • 1
  • Ying Wang
    • 2
  • Xiangrui Meng
    • 2
  • Farnaz Ronaghi
    • 3
  • Margot Gerritsen
    • 2
  • Amin Saberi
    • 3
  1. 1.Sandia National LaboratoriesLivermoreUSA
  2. 2.Institute for Computational and Mathematical EngineeringStanford UniversityStanfordUSA
  3. 3.Management Science and EngineeringStanford UniversityStanfordUSA

Personalised recommendations