GFG-Based Compression and Retrieval of Document Images in Indian Scripts

  • Gaurav HaritEmail author
  • Santanu Chaudhury
  • Ritu Garg
Part of the Advances in Pattern Recognition book series (ACVPR)


Indexing and retrieval of Indian language documents is an important problem. We present an interactive access scheme for Indian language document collection using techniques for word-image-based search. The compression and retrieval paradigm we propose is applicable even for those Indian scripts for which reliable OCR technology is not available. Our technique for word spotting is based on exploiting the geometrical features of the word image. The word image features are represented in the form of a graph called geometric feature graph (GFG). The GFG is encoded as a string which serves as a compressed representation of the word image skeleton. We have also augmented the GFG-based word image spotting with latent semantic analysis for more effective retrieval. The query is specified as a set of word images and the documents that best match with the query representation in the latent semantic space are retrieved. The retrieval paradigm is further enhanced to the conceptual level with the use of document image content-domain knowledge specified in the form of an ontology.


Geometric feature graph (GFG) Word spotting Latent semantic analysis Indic scripts 


  1. 1.
    R. Manmath, C. Han, and E. Riseman, “Word spotting: A new approach to indexing hand writing,” in Proceedings of IEEE CVPR, pp. 631–637, 1996.Google Scholar
  2. 2.
    A. K. Jain and A. M. Namboodiri, “Indexing and retrieval of on-line handwritten documents,” in Proceedings of IEEE ICDAR, pp. 655–659, 2003.Google Scholar
  3. 3.
    T. M. Rath and R. Manmatha, “Word image matching using dynamic time warping ,” in Proceedings of IEEE CVPR, vol. 2, pp. 521–527, 2003.Google Scholar
  4. 4.
    Deerwester, S. Dumais, Furnas, Lanouauer, and Harshman, “Indexing by latent semantic analysis,” Journal American Society for Information Retrieval, 41 (6), pp. 391–407, 1990.CrossRefGoogle Scholar
  5. 5.
    G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (Grenoble, France), pp. 465–480, 1988.Google Scholar
  6. 6.
    S. T. Dumais, “Latent semantic indexing (LSI),” in Proceedings of the Text Retrieval Conference (TREC-3), 1995.Google Scholar
  7. 7.
    S. Chaudhury, A. Roy, and L. Dey, “An MIMD algorithm for constant curvature feature extraction using curvature based data partitioning,” Pattern Recognition Letters, 20 (6), pp. 573–583, 1999.CrossRefGoogle Scholar
  8. 8.
    R. C. Gonzalez and R. E. Woods, Digital Image Processing. Prentice Hall, Upper Saddle River, NJ, 3rd ed., 2008.Google Scholar
  9. 9.
    E. Ukkonen, “Finding approximate patterns in string,” Journal of Algorithms, 6 (1), pp. 132–137, 1985.zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    S. Banerjee, G. Harit, and S. Chaudhury, “Word image based latent semantic indexing for conceptual querying in document image databases,” in Proceedings of IEEE ICDAR, vol. 2, pp. 1208–1212, 2007.Google Scholar
  11. 11.
    P. R. Christopher, D. Manning, and H. Schtze, Introduction to Information Retrieval. Cambridge University Press, Cambridge, 1st ed., 2008.Google Scholar
  12. 12.
    T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of SIGIR, 1999.Google Scholar
  13. 13.
    S. Kumar, N. Khanna, S. Chaudhury, and S. D. Joshi, “Locating text in images using matched wavelets,” in Proceedings of IEEE ICDAR, vol. 2, pp. 595–599, 2005.Google Scholar
  14. 14.
    L. Saul and F. Pereira, “Aggregate and mixed order Markov models for statistical language processing,” in Proceedings of the 2nd International Conference on Empirical Methods Natural Language Processing, pp. 81–89, 1997.Google Scholar
  15. 15.
    H. Ghosh, S. Chaudhury, K. Kashyap, and B. Maiti, Ontologies A Handbook of Principles, Concepts and Applications in Information Systems, ch. Ontology Specification and Integration for Multimedia Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA 2006.Google Scholar
  16. 16.
    G. Harit, S. Chaudhury, and J. Paranjpe, “Ontology guided access to document images,” in Proceedings of IEEE ICDAR, vol. 1, pp. 292–296, 2005.Google Scholar
  17. 17.
    H. Ghosh and S. Chaudhury, “Distributed and reactive query planning in R-MAGIC: An agent based multimedia retrieval system,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 1082–1095, September 2004.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  1. 1.IIT DelhiNew DelhiIndia

Personalised recommendations