Exploiting WWW Resources in Experimental Document Analysis Research

  • Daniel Lopresti
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2423)

Abstract

Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this paper, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running two simple tests involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial beneifts for both communities.

Keywords

Digital Library Document Analysis Document Image Optical Character Recognition Target Page 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    D. Lopresti and J. Zhou. Document analysis and the World Wide Web. In Proceedings of the Second IAPR Workshop on Document Analysis Systems, pages 651–669, Malvern, PA, Oct. 1996.Google Scholar
  2. 2.
    D. Lopresti and J. Zhou. Locating and recognizing text in WWW images. Information Retrieval, 2(2/3):177–206, May 2000.CrossRefGoogle Scholar
  3. 3.
    A. Antonacopoulos and D. Karatzas. An anthropocentric approach to text extraction from WWW images. In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, pages 515–525, Rio de Janeiro, Brazil, Dec. 2000.Google Scholar
  4. 4.
    A. C. Downton, A. C. Tams, G. J. Wells, A. C. Holmes, S. M. Lucas, G. W. Beccaloni, M. J. Scoble, and G. S. Robinson. Constructing web-based legacy index card archives-architectural design issues and initial data acquisition. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, pages 854–858, Seattle, WA, Sept. 2001.Google Scholar
  5. 5.
    O. Hitz, L. Robadey, and R. Ingold. An architecture for editing document recognition results using XML technology. In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, pages 385–396, Rio de Janeiro, Brazil, Dec. 2000.Google Scholar
  6. 9.
    I. Phillips, S. Chen, and R. Haralick. CD-ROM document database standard. In Proceedings of Second International Conference on Document Analysis and Recognition, pages 478–483, Tsukuba Science City, Japan, Oct. 1993.Google Scholar
  7. 11.
    Search result for Making of America, page 520 of The Development of College Architecture in America by Ashton R. Willard. http://cdl.library.cornell.edu/cgi-bin/moa/moa-cgi?notisid=AFJ3026-0022-73.
  8. 13.
    N. Baker. Double Fold: Libraries and the Assault on Paper. Random House, New York, NY, 2001.Google Scholar
  9. 14.
    E. J. Shaw and S. Blumson. Making of America: Online searching and page presentation at the University of Michigan. D-Lib Magazine, July/Aug. 1997. http://www.dlib.org/dlib/july97/america/07shaw.html.
  10. 15.
    D. G. Stork. The Open Mind Initiative. http://www.openmind.org/index.shtml.
  11. 16.
    M. D. Garris, S. A. Janet, and W. W. Klein. Federal Register document image database. In Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging), volume 3651, pages 97–108, San Jose, CA, Jan. 1999.Google Scholar
  12. 17.
    S. V. Rice, J. Kanai, and T. A. Nartker. Preparing OCR test data. Technical Report TR-93-08, UNLV Information Science Research Institute, Las Vegas, NV, June 1993.Google Scholar
  13. 19.
    H. S. Baird. Document image defect models. In H. S. Baird, H. Bunke, and K. Yamamoto, editors, Structured Document Image Analysis, pages 546–556. Springer-Verlag, New York, 1992.Google Scholar
  14. 20.
    Y. Wang, I. T. Phillips, and R. Haralick. Automatic table ground truth generation and a background-analysis-based table structure extraction method. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, pages 528–532, Seattle, WA, Sept. 2001.Google Scholar
  15. 21.
    H. S. Baird. Anatomy of a versatile page reader. Proceedings of the IEEE, 80(7):1059–1065, 1992.Google Scholar
  16. 22.
    J. Hu, R. Kashi, D. Lopresti, and G. Wilfong. Medium-independent table detection. In Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE Electronic Imaging), volume 3967, pages 291–302, San Jose, CA, Jan. 2000.Google Scholar
  17. 23.
    H. Schroeder and M. Doyle. Interactive Web Applications with Tcl/Tk. AP Professional, Chestnut Hill, MA, 1998.Google Scholar
  18. 24.
    G. Salton, A. Wong, and C. Yang. A vector space model for information retrieval. Communications of the Association for Computing Machinery, 18(11):613–620, Nov. 1975.MATHGoogle Scholar
  19. 25.
    G. Nagy and S. Seth. Hierarchical representation of optically scanned documents. In Proceedings of the Seventh International Conference on Pattern Recognition, pages 347–349, Montréal, Canada, July 1984.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Daniel Lopresti
    • 1
  1. 1.Bell Labs, Lucent Technologies Inc.Murray HillUSA

Personalised recommendations