Some computational tools for digital archive and metadata maintenance

Gleich, David F.; Wang, Ying; Meng, Xiangrui; Ronaghi, Farnaz; Gerritsen, Margot; Saberi, Amin

doi:10.1007/s10543-011-0324-6

Some computational tools for digital archive and metadata maintenance

Published: 11 March 2011

Volume 51, pages 127–154, (2011)
Cite this article

BIT Numerical Mathematics Aims and scope Submit manuscript

David F. Gleich¹,
Ying Wang²,
Xiangrui Meng²,
Farnaz Ronaghi³,
Margot Gerritsen² &
…
Amin Saberi³

290 Accesses
1 Citation
Explore all metrics

Abstract

Computational tools are a mainstay of current search and recommendation technology. But modern digital archives are astonishingly diverse collections of older digitized material and newer “born digital” content. Finding interesting material in these archives is still challenging. The material often lacks appropriate annotation—or metadata—so that people can find the most interesting material. We describe four computational tools we developed to aid in the processing and maintenance of large digital archives. The first is an improvement to a graph layout algorithm for graphs with hundreds of thousands of nodes. The second is a new algorithm for matching databases with links among the objects, also known as a network alignment problem. The third is an optimization heuristic to disambiguate a set of geographic references in a book. And the fourth is a technique to automatically generate a title from a description.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain Search and Exploration with Meta-Indexes

A Scalable Approach to Incrementally Building Knowledge Graphs

Overview Visualizations for Large Digitized Correspondence Collections: A Design Study

References

Adai, A.T., Date, S.V., Wieland, S., Marcotte, E.M.: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340(1), 179–190 (2004). doi:10.1016/j.jmb.2004.04.047
Article Google Scholar
Bayati, M., Gerritsen, M., Gleich, D.F., Saberi, A., Wang, Y.: Algorithms for large, sparse network alignment problems. In: Proceedings of the 9th IEEE International Conference on Data Mining, pp. 705–710 (2009). doi:10.1109/ICDM.2009.135
Chapter Google Scholar
Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Rev. 46(4), 647–666 (2004). doi:10.1137/S0036144502415960
Article MATH MathSciNet Google Scholar
Chew, P.A., Bader, B.W., Kolda, T.G., Abdelali, A.: Cross-language information retrieval using PARAFAC2. In: Berkhin, P., Caruana, R., Wu, X., Gaffney, S. (eds.) Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, pp. 143–152. Association for Computing Machinery, ACM Press, San Jose (2007). doi:10.1145/1281192.1281211
Chapter Google Scholar
Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(3), 265–298 (2004). doi:10.1142/S0218001404003228
Article Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002). http://gate.ac.uk/sale/acl02/acl-main.pdf
Google Scholar
de Groat, G.: Future directions in metadata remediation for metadata aggregators. Tech. rep., Digital Library Federation (2009)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990). doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Article Google Scholar
Ehrig, M., Staab, S.: QOM—quick ontology mapping. In: Third International Semantic Web Conference. LNCS, vol. 3298, pp. 683–697 (2004). doi:10.1007/b102467
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Morristown (2005). doi:10.3115/1219840.1219885
Chapter Google Scholar
Fraikin, C., Nesterov, Y., Dooren, P.V.: A gradient-type algorithm optimizing the coupling between matrices. Linear Algebra Appl. 429(5–6), 1229–1242 (2008). doi:10.1016/j.laa.2007.10.015
Article MATH MathSciNet Google Scholar
Fraikin, C., Nesterov, Y., Van Dooren, P.: Optimizing the coupling between two isometric projections of matrices. SIAM J. Matrix Anal. Appl. 30(1), 324–345 (2008). doi:10.1137/050643878
Article MATH MathSciNet Google Scholar
Göbel, F., Jagers, A.A.: Random walks on graphs. Stoch. Process. Appl. 2(4), 311–336 (1974). doi:10.1016/0304-4149(74)90001-5
Article MATH Google Scholar
Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009). doi:10.1109/MIS.2009.36
Article Google Scholar
Harshman, R.A.: PARAFAC2: Mathematical and technical notes. UCLA Work. Pap. Phon. 22, 30–44 (1972)
Google Scholar
Heer, J., Bostock, M.: Crowdsourcing graphical perception: using Mechanical Turk to assess visualization design. In: CHI ’10: Proceedings of the 28th International Conference on Human Factors in Computing Systems, pp. 203–212. ACM, New York (2010). doi:10.1145/1753326.1753357
Google Scholar
Higham, N.J.: Handbook of Writing for the Mathematical Sciences. SIAM, Philadelphia (1998)
Book MATH Google Scholar
Hu, W., Qu, Y., Cheng, G.: Matching large ontologies: A divide-and-conquer approach. Data Knowl. Eng. 67(1), 140–160 (2008). doi:10.1016/j.datak.2008.06.003
Article Google Scholar
Huberman, B.A., Romero, D.M., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14(1), Online (2008). URL http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/2063
Java, A.: Twitter social network analysis. UMBC ebquity blog (2007). URL http://ebiquity.umbc.edu/blogger/2007/04/19/twitter-social-network-analysis/
Java, A., Song, X., Finin, T., Tseng, B.: Why we Twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007). doi:10.1145/1348549.1348556
Chapter Google Scholar
Jia, Y., Hoberock, J., Garland, M., Hart, J.: On the visualization of social and other scale-free networks. IEEE Trans. Vis. Comput. Graph. 41(6), 1285–1292 (2008). doi:10.1109/TVCG.2008.151
Google Scholar
Karypis, G.: CLUTO—a clustering toolkit. Tech. Rep. 02-017, University of Minnesota, Department of Computer. Science (2002). URL http://glaros.dtc.umn.edu/gkhome/views/cluto/
Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process. 35(3), 400–401 (1987)
Article Google Scholar
Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: CHI ’08: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 453–456. ACM, New York (2008). doi:10.1145/1357054.1357127
Chapter Google Scholar
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95, vol. 1, pp. 181–184 (1995). doi:10.1109/ICASSP.1995.479394
Chapter Google Scholar
Körner, C., Benz, D., Hotho, A., Strohmaier, M., Stumme, G.: Stop thinking, start tagging: tag semantics emerge from collaborative verbosity. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 521–530. ACM, New York (2010). doi:10.1145/1772690.1772744
Chapter Google Scholar
Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about Twitter. In: WOSP ’08: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008). doi:10.1145/1397735.1397741
Chapter Google Scholar
Kuny, T.: A digital dark ages? Challenges in the preservation of electronic information. In: 63rd International Federation of Library Associations and Institutions Council and General Conference (IFLA1997) (1997). URL http://ifla.queenslibrary.org/iv/ifla63/63kuny1.pdf
Google Scholar
Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media. In: WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, New York (2010). doi:10.1145/1772690.1772751
Chapter Google Scholar
Levy, S.: How Google’s algorithm rules the web. Wired Mag. 18(3) (2010). http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1
McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005). doi:10.1145/1105664.1105679
Article Google Scholar
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, p. 117. IEEE Computer Society, San Jose (2002)
Chapter Google Scholar
Ninove, L.: Dominant vectors of nonnegative matrices: Application to information extraction in large graphs. Ph.D. thesis, Université Catholique de Louvain (2008)
Rafiei, D., Curial, S.: Effectively visualizing large networks through sampling. Vis. Conf., IEEE 0, 48 (2005). doi:10.1109/VIS.2005.25
Google Scholar
Rajaraman, A.: More data usually beats better algorithms. Datawocky Blog (2008). URL http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
Ross, J., Irani, L., Silberman, M.S., Zaldivar, A., Tomlinson, B.: Who are the crowdworkers? Shifting demographics in Mechanical Turk. In: CHI EA ’10: Proceedings of the 28th International Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2863–2872. ACM, New York (2010). doi:10.1145/1753846.1753873
Chapter Google Scholar
Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983). doi:10.1016/0378-8733(83)90028-X
Article MathSciNet Google Scholar
Silberman, M.S., Ross, J., Irani, L., Tomlinson, B.: Sellers’ problems in human computation markets. In: HCOMP ’10: Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 18–21. ACM, New York (2010). doi:10.1145/1837885.1837891
Chapter Google Scholar
Surowiecki, J.: The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday (2005)
Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions, pp. 33–40. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1119282.1119287
Chapter Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 173–180. Association for Computational Linguistics, Morristown (2003). doi:10.3115/1073445.1073478
Chapter Google Scholar
Various: The MARC Standard. URL http://www.loc.gov/marc (2007). Accessed on 17 September 2007
Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: WSDM ’10: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM, New York (2010). doi:10.1145/1718487.1718520
Chapter Google Scholar
Yee, K.P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: CHI ’03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 401–408. ACM, New York (2003). doi:10.1145/642611.642681
Google Scholar
Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on wikipedia for semantic relatedness. In: TextGraphs-4: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics, Morristown (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Sandia National Laboratories, Livermore, CA, 94550, USA
David F. Gleich
Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, 94305, USA
Ying Wang, Xiangrui Meng & Margot Gerritsen
Management Science and Engineering, Stanford University, Stanford, CA, 94305, USA
Farnaz Ronaghi & Amin Saberi

Authors

David F. Gleich
View author publications
You can also search for this author in PubMed Google Scholar
Ying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangrui Meng
View author publications
You can also search for this author in PubMed Google Scholar
Farnaz Ronaghi
View author publications
You can also search for this author in PubMed Google Scholar
Margot Gerritsen
View author publications
You can also search for this author in PubMed Google Scholar
Amin Saberi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Margot Gerritsen.

Additional information

Communicated by Axel Ruhe.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04- 94AL85000.

The majority of David’s work was completed while at Stanford University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gleich, D.F., Wang, Y., Meng, X. et al. Some computational tools for digital archive and metadata maintenance. Bit Numer Math 51, 127–154 (2011). https://doi.org/10.1007/s10543-011-0324-6

Download citation

Received: 13 September 2010
Accepted: 22 February 2011
Published: 11 March 2011
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10543-011-0324-6

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Some computational tools for digital archive and metadata maintenance

Abstract

Access this article

Similar content being viewed by others

Domain Search and Exploration with Meta-Indexes

A Scalable Approach to Incrementally Building Knowledge Graphs

Overview Visualizations for Large Digitized Correspondence Collections: A Design Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Some computational tools for digital archive and metadata maintenance

Abstract

Access this article

Similar content being viewed by others

Domain Search and Exploration with Meta-Indexes

A Scalable Approach to Incrementally Building Knowledge Graphs

Overview Visualizations for Large Digitized Correspondence Collections: A Design Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation