Skip to main content

Information Retrieval and Visualization for the Historical Domain

  • Conference paper
  • First Online:
Language Technology for Cultural Heritage

Abstract

Working with large and unstructured collections of historical documents is a challenging task for historians. Despite the recent growth in the volume of digitized historical data, available collections are rarely accompanied by computational tools that significantly facilitate this task.We address this shortage by proposing a visualization method for document collections that focuses on graphical representation of similarities between documents. The strength of the similarities is measured according to the overlap of historically significant information such as named entities,or the overlap of general vocabulary. Similarity strengths are then encoded in the edges of a graph.The graph provides visual structure, revealing interpretable clusters and links between documents that are otherwise difficult to establish. We implement the idea of similarity graphs within an information retrieval system supported by an interactive graphical user interface. The system allows querying the database, visualizing the results and browsing the collection in an effective and intuitive way. Our aproach can be easy adapted and extended to collections of documents in other domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: TextGraphs ’06: Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80. Association for Computational Linguistics, Morristown, NJ, USA (2006)

    Google Scholar 

  2. Blei, D., Lafferty, J.: Topic models. Text mining: classification, clustering, and applications pp. 71–93 (2009)

    Google Scholar 

  3. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)

    Article  MATH  Google Scholar 

  4. Börner, K., Chen, C., Boyack, K.: Visualizing knowledge domains. Annual review of information science and technology 37(1), 179–255 (2003)

    Article  Google Scholar 

  5. Dunne, C., Shneiderman, B., Dorr, B., Klavans, J.: iOPENER workbench: Tools for rapid understanding of scientific literature. In: Human-Computer Interaction Lab 27th Annual Symposium (2010)

    Google Scholar 

  6. Eades, P.: Graph drawing methods. In: P. Eklund, G. Ellis, G. Mann (eds.) Conceptual Structures: Knowledge Representation as Interlingua, Lecture Notes in Computer Science, vol. 1115, pp. 40–49. Springer Berlin / Heidelberg (1996)

    Google Scholar 

  7. Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Ann Arbor, MI, USA (2005)

    Google Scholar 

  8. Fruchterman, T.M.J., Reingold, E.M.: Graph drawing by force-directed placement (1991)

    Google Scholar 

  9. Greaves, M.: The growing semantic web. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, ECML PKDD ’09, p. 3. Springer, Berlin and Heidelberg, Germany (2009)

    Google Scholar 

  10. Hearst, M.A.: Search User Interfaces, chap. Information Visualization for Search Interfaces. Cambridge University Press, Cambridge, England (2009)

    Google Scholar 

  11. Java Universal Network/Graph framework (JUNG). Retrieved 19 Nov 2010, from http://jung.sourceforge.net/

  12. Koren, Y.: On spectral graph drawing. In: Proc. 9th Inter. Computing and Combinatorics Conference (COCOONâĂŹ03), LNCS 2697, pp. 496–508. Springer-Verlag (2002)

    Google Scholar 

  13. Castro Speech Database. Retrieved 17 Nov 2010, from http://lanic.utexas.edu/la/cb/cuba/castro.html

  14. Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. The Journal of Machine Learning Research 5, 1435–1455 (2004)

    MathSciNet  Google Scholar 

  15. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. The Journal of Machine Learning Research 2, 419–444 (2002)

    Article  MATH  Google Scholar 

  16. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  17. Manning, C.D., Raghavan, P., Schütze, H.: An introduction to information retrieval. Cambridge University Press, Cambridge, England (2008)

    Google Scholar 

  18. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: D. Nadeau, S. Sekine (eds.) Named Entities: Recognition, classification and use. John Benjamins, Amsterdam, the Netherlands and New York, NY, USA (2009)

    Google Scholar 

  19. Puppe, T.: Spectral Graph Drawing: A Survey. VDM Verlag (2008)

    Google Scholar 

  20. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  21. Sanguinetti, G., Laidler, J., Lawrence, N.D.: Automatic determination of the number of clusters using spectral algorithms.in. In: IEEE Machine Learning for Signal Processing. 28-30 Sept 2005, pp. 28–30 (2005)

    Google Scholar 

  22. Scholkopf, B.: The kernel trick for distances. Advances in Neural Information Processing Systems 13, 301–307 (2001)

    Google Scholar 

  23. Schvaneveldt, R.: Pathfinder associative networks: studies in knowledge organization. Ablex Series In Computational Science (1990)

    Google Scholar 

  24. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge, England (2004)

    Google Scholar 

  25. Verma, D., Meila, M.: A comparison of spectral clustering algorithms. Tech. rep., Department of CSE, University of Washington (2003)

    Google Scholar 

  26. Versley, Y., Moschitti, A., Poesio, M., Yang, X.: Coreference systems based on kernels methods. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Computational Linguistics, Manchester, England (2008)

    Google Scholar 

  27. Versley, Y., Ponzetto, S., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A.: BART: A modular toolkit for coreference resolution. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. LREC, Marrakech, Morocco (2008)

    Google Scholar 

  28. Visuwords: online graphical dictionary. Retrieved 21 Oct 2010, from http://www.visuwords.com/

Download references

Acknowledgements

We would like to thank the following people at Saarland University:

 ∙ Caroline Sporleder, for her dedicated guidance and valuable advice on the project.

 ∙ Martin Schreiber, for his feedback on the system from the user perspective.

Michal Richter has been supported by grant ME838 of the Czech Republic Ministry of Education, Youth and Sport.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yevgeni Berzak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Berzak, Y., Richter, M., Ehrler, C., Shore, T. (2011). Information Retrieval and Visualization for the Historical Domain. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20227-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20227-8_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20226-1

  • Online ISBN: 978-3-642-20227-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics