Exploiting WWW Resources in Experimental Document Analysis Research
Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this paper, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running two simple tests involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial beneifts for both communities.
KeywordsDigital Library Document Analysis Document Image Optical Character Recognition Target Page
- 1.D. Lopresti and J. Zhou. Document analysis and the World Wide Web. In Proceedings of the Second IAPR Workshop on Document Analysis Systems, pages 651–669, Malvern, PA, Oct. 1996.Google Scholar
- 3.A. Antonacopoulos and D. Karatzas. An anthropocentric approach to text extraction from WWW images. In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, pages 515–525, Rio de Janeiro, Brazil, Dec. 2000.Google Scholar
- 4.A. C. Downton, A. C. Tams, G. J. Wells, A. C. Holmes, S. M. Lucas, G. W. Beccaloni, M. J. Scoble, and G. S. Robinson. Constructing web-based legacy index card archives-architectural design issues and initial data acquisition. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, pages 854–858, Seattle, WA, Sept. 2001.Google Scholar
- 5.O. Hitz, L. Robadey, and R. Ingold. An architecture for editing document recognition results using XML technology. In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, pages 385–396, Rio de Janeiro, Brazil, Dec. 2000.Google Scholar
- 9.I. Phillips, S. Chen, and R. Haralick. CD-ROM document database standard. In Proceedings of Second International Conference on Document Analysis and Recognition, pages 478–483, Tsukuba Science City, Japan, Oct. 1993.Google Scholar
- 11.Search result for Making of America, page 520 of The Development of College Architecture in America by Ashton R. Willard. http://cdl.library.cornell.edu/cgi-bin/moa/moa-cgi?notisid=AFJ3026-0022-73.
- 13.N. Baker. Double Fold: Libraries and the Assault on Paper. Random House, New York, NY, 2001.Google Scholar
- 14.E. J. Shaw and S. Blumson. Making of America: Online searching and page presentation at the University of Michigan. D-Lib Magazine, July/Aug. 1997. http://www.dlib.org/dlib/july97/america/07shaw.html.
- 15.D. G. Stork. The Open Mind Initiative. http://www.openmind.org/index.shtml.
- 16.M. D. Garris, S. A. Janet, and W. W. Klein. Federal Register document image database. In Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging), volume 3651, pages 97–108, San Jose, CA, Jan. 1999.Google Scholar
- 17.S. V. Rice, J. Kanai, and T. A. Nartker. Preparing OCR test data. Technical Report TR-93-08, UNLV Information Science Research Institute, Las Vegas, NV, June 1993.Google Scholar
- 19.H. S. Baird. Document image defect models. In H. S. Baird, H. Bunke, and K. Yamamoto, editors, Structured Document Image Analysis, pages 546–556. Springer-Verlag, New York, 1992.Google Scholar
- 20.Y. Wang, I. T. Phillips, and R. Haralick. Automatic table ground truth generation and a background-analysis-based table structure extraction method. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, pages 528–532, Seattle, WA, Sept. 2001.Google Scholar
- 21.H. S. Baird. Anatomy of a versatile page reader. Proceedings of the IEEE, 80(7):1059–1065, 1992.Google Scholar
- 22.J. Hu, R. Kashi, D. Lopresti, and G. Wilfong. Medium-independent table detection. In Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE Electronic Imaging), volume 3967, pages 291–302, San Jose, CA, Jan. 2000.Google Scholar
- 23.H. Schroeder and M. Doyle. Interactive Web Applications with Tcl/Tk. AP Professional, Chestnut Hill, MA, 1998.Google Scholar
- 25.G. Nagy and S. Seth. Hierarchical representation of optically scanned documents. In Proceedings of the Seventh International Conference on Pattern Recognition, pages 347–349, Montréal, Canada, July 1984.Google Scholar