An Image Based Approach for Content Analysis in Document Collections

  • Reinhold Huber-Mörk
  • Alexander Schindler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8034)


We consider the task of content based analysis and categorization in large-scale historical book scanning projects. Mixed content, deprecated language, noise and unexpected distortions suggest an image based approach. The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents. In order to incorporate spatial information into the bag of features approach we consider three methods of spatial verification. An approach based on comparison of statistical properties of local keypoint properties such as size orientation and scale showed comparable quality in content comparison while being computationally much more efficient. Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability.


Visual Word Document Image Term Frequency Scale Invariant Feature Transform Historical Book 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baluja, S., Covell, M.: Finding images and line drawings in document-scanning systems. In: Proc. Intl. Conf. on Doc. Anal. and Retrieval, ICDAR 2009 (2009)Google Scholar
  2. 2.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Chaudhury, K., Jain, A., Thirthala, S., Sahasranaman, V., Saxena, S., Mahalingam, S.: Google newspaper search - image processing and analysis pipeline. In: Proc. Intl. Conf. on Doc. Analysis and Recognition, ICDAR 2009 (2009)Google Scholar
  4. 4.
    Chum, O., Matas, J.: Unsupervised discovery of co-occurrence in sparse high dimensional data. In: Proc. Comp. Vis. and Pat. Rec., CVPR 2010 (2010)Google Scholar
  5. 5.
    Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV 2004 (2004)Google Scholar
  6. 6.
    Doermann, D., Li, H., Kia, O.: The detection of duplicates in document image databases. Image and Vision Computing 16(12-13), 907–920 (1998)CrossRefGoogle Scholar
  7. 7.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. Conf. on Knowledge Discovery and Data Mining, KDD 1996 (1996)Google Scholar
  8. 8.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Garz, A., Sablatnig, R., Diem, M.: Layout analysis for historic manuscripts using SIFT features. In: Proc. Intl. Conf. on Doc. Anal. and Rec., ICDAR 2011 (2011)Google Scholar
  10. 10.
    Hazelhoff, L., Creusen, I., van de Wouw, D., de With, P.H.N.: Large-scale classification of traffic signs under real-world conditions. In: Proc. SPIE Electronic Imaging: Algorithms and Systems VI (2012)Google Scholar
  11. 11.
    Huber-Mörk, R., Schindler, A.: Quality assurance for document image collections in digital preservation. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P., Zemčík, P. (eds.) ACIVS 2012. LNCS, vol. 7517, pp. 108–119. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Huber-Mörk, R., Schindler, A., Schlarb, S.: Duplicate detection for quality assurance of document image collections. In: Proc. Conf. on Digital Preservation, iPres 2012 (2012)Google Scholar
  13. 13.
    Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: Proc. Computer Vision and Pattern Recognition, CVPR 2009 (2009)Google Scholar
  14. 14.
    Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: Proc. Intl. Conf. on Multimedia, MULTIMEDIA 2004 (2004)Google Scholar
  15. 15.
    Knopp, J., Sivic, J., Pajdla, T.: Avoiding confusing features in place recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 748–761. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Langley, A., Bloomberg, D.S.: Google books: making the public domain universally accessible. In: Proc. of SPIE, Doc. Rec. and Retrieval XIV (2007)Google Scholar
  17. 17.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  18. 18.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 7th edn. Cambridge University Press (2008)Google Scholar
  19. 19.
    Ramachandrula, S., Joshi, G.D., Noushath, S., Parikh, P., Gupta, V.: PaperDiff: A script independent automatic method for finding the text differences between two document images. In: Proc. Intl. Workshop on Docu. Anal. Syst. (2008)Google Scholar
  20. 20.
    Rao, J.S.: Bahadur efficiencies of some tests for uniformity on the circle. Ann. Math. Statist. 43(2), 468–479 (1972)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Schilcher, U., Gyarmati, M., Bettstetter, C., Chung, Y.W., Kim, Y.H.: Measuring inhomogeneity in spatial distributions. In: Proc. Vehicular Technology Conference, VTC 2008 (2008)Google Scholar
  22. 22.
    van Beusekom, J., Shafait, F., Breuel, T.M.: Image-matching for revision detection in printed historical documents. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 507–516. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  23. 23.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004)CrossRefGoogle Scholar
  24. 24.
    Wu, X., Zhao, W.-L., Ngo, C.-W.: Near-duplicate keyframe retrieval with visual keywords and semantic context. In: Proc. Conf. on Image and Video Retrieval, CIVR 2007 (2007)Google Scholar
  25. 25.
    Xu, D., Cham, T.J., Yan, S., Duan, L., Chang, S.-F.: Near duplicate identification with spatially aligned pyramid matching. IEEE Trans. Circuits Syst. Video Techn. 20(8), 1068–1079 (2010)CrossRefGoogle Scholar
  26. 26.
    Zhang, S., Tian, Q., Hua, G., Huang, Q., Li, S.: Descriptive visual words and visual phrases for image applications. In: Proc. Intl. Conf. on Multimedia, MULTIMEDIA 2009 (2009)Google Scholar
  27. 27.
    Zhao, W.-L., Ngo, C.-W., Tan, H.-K., Wu, X.: Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Trans. Pat. Anal. Mach. Intell. 9(5), 1037–1048 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Reinhold Huber-Mörk
    • 1
  • Alexander Schindler
    • 1
  1. 1.Intelligent Vision Systems, Safety & Security DepartmentAIT Austrian Institute of Technology GmbHViennaAustria

Personalised recommendations