An Image Based Approach for Content Analysis in Document Collections

  • Reinhold Huber-Mörk
  • Alexander Schindler
Conference paper

DOI: 10.1007/978-3-642-41939-3_27

Part of the Lecture Notes in Computer Science book series (LNCS, volume 8034)
Cite this paper as:
Huber-Mörk R., Schindler A. (2013) An Image Based Approach for Content Analysis in Document Collections. In: Bebis G. et al. (eds) Advances in Visual Computing. ISVC 2013. Lecture Notes in Computer Science, vol 8034. Springer, Berlin, Heidelberg

Abstract

We consider the task of content based analysis and categorization in large-scale historical book scanning projects. Mixed content, deprecated language, noise and unexpected distortions suggest an image based approach. The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents. In order to incorporate spatial information into the bag of features approach we consider three methods of spatial verification. An approach based on comparison of statistical properties of local keypoint properties such as size orientation and scale showed comparable quality in content comparison while being computationally much more efficient. Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Reinhold Huber-Mörk
    • 1
  • Alexander Schindler
    • 1
  1. 1.Intelligent Vision Systems, Safety & Security DepartmentAIT Austrian Institute of Technology GmbHViennaAustria

Personalised recommendations