Group 4 Compressed Document Matching

  • Dar-Shyang Lee
  • Jonathan J. Hull
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1655)

Abstract

Numerous approaches, including textual, structural and featural, for detecting duplicate documents have been investigated. Considering document images are usually stored and transmitted in compressed forms, it is advantageous to perform document matching directly on the compressed data. A two-stage process for matching Group 4 compressed document images is presented. In the coarse matching stage, ranked hypotheses are generated based on compression bit profile correlations. These candidates are further evaluated using a feature set similar to the pass codes. Multiple descriptors based on local arrangement of the feature points are constructed for efficient indexing into the database. Performance of the algorithm on the UW database is discussed.

Keywords

Feature Point Hausdorff Distance Document Image Text Line Pass Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Reference

  1. 1.
    V. Chalana, A. Bruce, and T. Nguyen, “ Duplicate document detection in DocBrowse”, SPIE Conference on Document Recognition V, pp. 169–178, 1998.Google Scholar
  2. 2.
    D. Doermann, H. Li, O. Kia and K. Kilic, “The Detection of Duplicates in Document Image Databases”, Technical Report CS-TR-3739, University of Maryland, 1997.Google Scholar
  3. 3.
    J. J. Hull, “ Document Matching on CCITT Group 4 Compressed Images”, SPIE Conference on Document Recognition IV, pages 82–87, 1997.Google Scholar
  4. 4.
    J. J. Hull, “ Document image matching and retrieval with multiple distortion-invariant descriptors”, Proceedings of DAS, pages 383–400, 1994.Google Scholar
  5. 5.
    J. J. Hull, “ Document image similarity and equivalence detection”, International Journal on Document Analysis and Recognition, Vol. 1, No. 1, pp.37–42, 1998..Google Scholar
  6. 6.
    R. Hunter, A. H. Robinson., ” International Digital Facsimile Coding Standards,” Proceedings of the IEEE, Vol. 68, No. 7, pp. 854–867, 1980.CrossRefGoogle Scholar
  7. 7.
    I. T. Phillips, S. Chen, R. M. Haralick, “ CD-ROM document database standard”, Proceedings of the 2nd ICDAR, pp. 478–483, 1993.Google Scholar
  8. 8.
    A. L. Spitz, “ Skew determination in CCITT group 4 compressed document images,” Proceedings of SDAIR, pp. 11–25, 1992.Google Scholar
  9. 9.
    A. L. Spitz, “ Using character shape codes for word spotting in document images”, Shape, Structure and Pattern Recognition, pages 382–389. World Scientific, 1995.Google Scholar
  10. 10.
    A. L. Spitz, “ Using character shape coding for information retrieval”, Proceedings of the 4th ICDAR, pp. 974–978, 1997.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Dar-Shyang Lee
    • 1
  • Jonathan J. Hull
    • 1
  1. 1.Ricoh California Research CenterMenlo Park

Personalised recommendations