Abstract
Working with large and unstructured collections of historical documents is a challenging task for historians. Despite the recent growth in the volume of digitized historical data, available collections are rarely accompanied by computational tools that significantly facilitate this task.We address this shortage by proposing a visualization method for document collections that focuses on graphical representation of similarities between documents. The strength of the similarities is measured according to the overlap of historically significant information such as named entities,or the overlap of general vocabulary. Similarity strengths are then encoded in the edges of a graph.The graph provides visual structure, revealing interpretable clusters and links between documents that are otherwise difficult to establish. We implement the idea of similarity graphs within an information retrieval system supported by an interactive graphical user interface. The system allows querying the database, visualizing the results and browsing the collection in an effective and intuitive way. Our aproach can be easy adapted and extended to collections of documents in other domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: TextGraphs ’06: Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80. Association for Computational Linguistics, Morristown, NJ, USA (2006)
Blei, D., Lafferty, J.: Topic models. Text mining: classification, clustering, and applications pp. 71–93 (2009)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Börner, K., Chen, C., Boyack, K.: Visualizing knowledge domains. Annual review of information science and technology 37(1), 179–255 (2003)
Dunne, C., Shneiderman, B., Dorr, B., Klavans, J.: iOPENER workbench: Tools for rapid understanding of scientific literature. In: Human-Computer Interaction Lab 27th Annual Symposium (2010)
Eades, P.: Graph drawing methods. In: P. Eklund, G. Ellis, G. Mann (eds.) Conceptual Structures: Knowledge Representation as Interlingua, Lecture Notes in Computer Science, vol. 1115, pp. 40–49. Springer Berlin / Heidelberg (1996)
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics, Ann Arbor, MI, USA (2005)
Fruchterman, T.M.J., Reingold, E.M.: Graph drawing by force-directed placement (1991)
Greaves, M.: The growing semantic web. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, ECML PKDD ’09, p. 3. Springer, Berlin and Heidelberg, Germany (2009)
Hearst, M.A.: Search User Interfaces, chap. Information Visualization for Search Interfaces. Cambridge University Press, Cambridge, England (2009)
Java Universal Network/Graph framework (JUNG). Retrieved 19 Nov 2010, from http://jung.sourceforge.net/
Koren, Y.: On spectral graph drawing. In: Proc. 9th Inter. Computing and Combinatorics Conference (COCOONâĂŹ03), LNCS 2697, pp. 496–508. Springer-Verlag (2002)
Castro Speech Database. Retrieved 17 Nov 2010, from http://lanic.utexas.edu/la/cb/cuba/castro.html
Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. The Journal of Machine Learning Research 5, 1435–1455 (2004)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. The Journal of Machine Learning Research 2, 419–444 (2002)
Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395–416 (2007)
Manning, C.D., Raghavan, P., Schütze, H.: An introduction to information retrieval. Cambridge University Press, Cambridge, England (2008)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: D. Nadeau, S. Sekine (eds.) Named Entities: Recognition, classification and use. John Benjamins, Amsterdam, the Netherlands and New York, NY, USA (2009)
Puppe, T.: Spectral Graph Drawing: A Survey. VDM Verlag (2008)
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Sanguinetti, G., Laidler, J., Lawrence, N.D.: Automatic determination of the number of clusters using spectral algorithms.in. In: IEEE Machine Learning for Signal Processing. 28-30 Sept 2005, pp. 28–30 (2005)
Scholkopf, B.: The kernel trick for distances. Advances in Neural Information Processing Systems 13, 301–307 (2001)
Schvaneveldt, R.: Pathfinder associative networks: studies in knowledge organization. Ablex Series In Computational Science (1990)
Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge, England (2004)
Verma, D., Meila, M.: A comparison of spectral clustering algorithms. Tech. rep., Department of CSE, University of Washington (2003)
Versley, Y., Moschitti, A., Poesio, M., Yang, X.: Coreference systems based on kernels methods. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Computational Linguistics, Manchester, England (2008)
Versley, Y., Ponzetto, S., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A.: BART: A modular toolkit for coreference resolution. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. LREC, Marrakech, Morocco (2008)
Visuwords: online graphical dictionary. Retrieved 21 Oct 2010, from http://www.visuwords.com/
Acknowledgements
We would like to thank the following people at Saarland University:
∙ Caroline Sporleder, for her dedicated guidance and valuable advice on the project.
∙ Martin Schreiber, for his feedback on the system from the user perspective.
Michal Richter has been supported by grant ME838 of the Czech Republic Ministry of Education, Youth and Sport.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Berzak, Y., Richter, M., Ehrler, C., Shore, T. (2011). Information Retrieval and Visualization for the Historical Domain. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20227-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-20227-8_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20226-1
Online ISBN: 978-3-642-20227-8
eBook Packages: Computer ScienceComputer Science (R0)