Information Retrieval and Visualization for the Historical Domain

Berzak, Yevgeni; Richter, Michal; Ehrler, Carsten; Shore, Todd

doi:10.1007/978-3-642-20227-8_11

Yevgeni Berzak⁴,
Michal Richter⁴,
Carsten Ehrler⁴ &
…
Todd Shore⁴

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

684 Accesses

Abstract

Working with large and unstructured collections of historical documents is a challenging task for historians. Despite the recent growth in the volume of digitized historical data, available collections are rarely accompanied by computational tools that significantly facilitate this task.We address this shortage by proposing a visualization method for document collections that focuses on graphical representation of similarities between documents. The strength of the similarities is measured according to the overlap of historically significant information such as named entities,or the overlap of general vocabulary. Similarity strengths are then encoded in the edges of a graph.The graph provides visual structure, revealing interpretable clusters and links between documents that are otherwise difficult to establish. We implement the idea of similarity graphs within an information retrieval system supported by an interactive graphical user interface. The system allows querying the database, visualizing the results and browsing the collection in an effective and intuitive way. Our aproach can be easy adapted and extended to collections of documents in other domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: TextGraphs ’06: Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80. Association for Computational Linguistics, Morristown, NJ, USA (2006)
Google Scholar
Blei, D., Lafferty, J.: Topic models. Text mining: classification, clustering, and applications pp. 71–93 (2009)
Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Article MATH Google Scholar
Börner, K., Chen, C., Boyack, K.: Visualizing knowledge domains. Annual review of information science and technology 37(1), 179–255 (2003)
Article Google Scholar
Dunne, C., Shneiderman, B., Dorr, B., Klavans, J.: iOPENER workbench: Tools for rapid understanding of scientific literature. In: Human-Computer Interaction Lab 27^th Annual Symposium (2010)
Google Scholar
Eades, P.: Graph drawing methods. In: P. Eklund, G. Ellis, G. Mann (eds.) Conceptual Structures: Knowledge Representation as Interlingua, Lecture Notes in Computer Science, vol. 1115, pp. 40–49. Springer Berlin / Heidelberg (1996)
Google Scholar
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Ann Arbor, MI, USA (2005)
Google Scholar
Fruchterman, T.M.J., Reingold, E.M.: Graph drawing by force-directed placement (1991)
Google Scholar
Greaves, M.: The growing semantic web. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, ECML PKDD ’09, p. 3. Springer, Berlin and Heidelberg, Germany (2009)
Google Scholar
Hearst, M.A.: Search User Interfaces, chap. Information Visualization for Search Interfaces. Cambridge University Press, Cambridge, England (2009)
Google Scholar
Java Universal Network/Graph framework (JUNG). Retrieved 19 Nov 2010, from http://jung.sourceforge.net/
Koren, Y.: On spectral graph drawing. In: Proc. 9th Inter. Computing and Combinatorics Conference (COCOONâĂŹ03), LNCS 2697, pp. 496–508. Springer-Verlag (2002)
Google Scholar
Castro Speech Database. Retrieved 17 Nov 2010, from http://lanic.utexas.edu/la/cb/cuba/castro.html
Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. The Journal of Machine Learning Research 5, 1435–1455 (2004)
MathSciNet Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. The Journal of Machine Learning Research 2, 419–444 (2002)
Article MATH Google Scholar
Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395–416 (2007)
Article MathSciNet Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: An introduction to information retrieval. Cambridge University Press, Cambridge, England (2008)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: D. Nadeau, S. Sekine (eds.) Named Entities: Recognition, classification and use. John Benjamins, Amsterdam, the Netherlands and New York, NY, USA (2009)
Google Scholar
Puppe, T.: Spectral Graph Drawing: A Survey. VDM Verlag (2008)
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sanguinetti, G., Laidler, J., Lawrence, N.D.: Automatic determination of the number of clusters using spectral algorithms.in. In: IEEE Machine Learning for Signal Processing. 28-30 Sept 2005, pp. 28–30 (2005)
Google Scholar
Scholkopf, B.: The kernel trick for distances. Advances in Neural Information Processing Systems 13, 301–307 (2001)
Google Scholar
Schvaneveldt, R.: Pathfinder associative networks: studies in knowledge organization. Ablex Series In Computational Science (1990)
Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge, England (2004)
Google Scholar
Verma, D., Meila, M.: A comparison of spectral clustering algorithms. Tech. rep., Department of CSE, University of Washington (2003)
Google Scholar
Versley, Y., Moschitti, A., Poesio, M., Yang, X.: Coreference systems based on kernels methods. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Computational Linguistics, Manchester, England (2008)
Google Scholar
Versley, Y., Ponzetto, S., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A.: BART: A modular toolkit for coreference resolution. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. LREC, Marrakech, Morocco (2008)
Google Scholar
Visuwords: online graphical dictionary. Retrieved 21 Oct 2010, from http://www.visuwords.com/

Download references

Acknowledgements

We would like to thank the following people at Saarland University:

∙ Caroline Sporleder, for her dedicated guidance and valuable advice on the project.

∙ Martin Schreiber, for his feedback on the system from the user perspective.

Michal Richter has been supported by grant ME838 of the Czech Republic Ministry of Education, Youth and Sport.

Author information

Authors and Affiliations

Saarland University, 66041, Saarbrücken, Germany
Yevgeni Berzak, Michal Richter, Carsten Ehrler & Todd Shore

Authors

Yevgeni Berzak
View author publications
You can also search for this author in PubMed Google Scholar
Michal Richter
View author publications
You can also search for this author in PubMed Google Scholar
Carsten Ehrler
View author publications
You can also search for this author in PubMed Google Scholar
Todd Shore
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yevgeni Berzak .

Editor information

Editors and Affiliations

, Computational Linguistics / MMCI, Saarland University, Saarbrücken, 66041, Germany
Caroline Sporleder
Fac. Humanities, Tilburg University, Tilburg, Netherlands
Antal van den Bosch
Tilburg School for Humanities, Tilburg Center for Cognition and Communi, University of Tilburg, Tilburg, 5000, Netherlands
Kalliopi Zervanou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Berzak, Y., Richter, M., Ehrler, C., Shore, T. (2011). Information Retrieval and Visualization for the Historical Domain. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20227-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-20227-8_11
Published: 26 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20226-1
Online ISBN: 978-3-642-20227-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics