Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space
It is important to utilize retrospective documents when constructing a large digital library. This paper proposes a method for analyzing recognized bibliographic strings using an extended hidden Markov model. The proposed method enables analysis of erroneous bibliographic strings and integrates many documents accumulated as printed articles in a citation index. The proposed method has the advantage of providing a robust bibliographic matching function using the statistical description of the syntax of bibliographic strings, a language model and an Optical Character Recognition (OCR) error model. The method also has the advantage of reducing the cost of preparing training data for parameter estimation, using records in the bibliographic database.
KeywordsDigital Library Document Image Query Term Optical Character Recognition Bibliographic Database
Unable to display preview. Download preview PDF.
- 2.A. Belaid, J. C. Anigbogu, and Y. Chenevoy. Qualitative Analysis of Low-Level Logical Structures. In Proc. of International Conference on Electronic Publishing, pages 435–446, 1994.Google Scholar
- 3.H. Bunke and P.S.P. Wang, editors. Handbook of Character Recoginition and Document Image Analysis. World Scientific, 1997.Google Scholar
- 4.CrossRef The central source for reference linking:. http://www.crossref.org/. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
- 5.C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An Automatic Citation Indexing System. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.Google Scholar
- 8.The Digital Object Identifier:. http://www.doi.org/. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
- 11.S. Lawrence, C. L. Giles, and K. D. Bollacker. Digital libraries and autonmous citation indexing. IEEE Computer, 32(6):67–71, June 1999.Google Scholar
- 12.Y. Li, D. Lopresti, and A. Tomkins. “Validation of Document Image Defect Models for Optical Character Recognition”. In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137–150, 1994.Google Scholar
- 13.T. O'Neill, E., A. Rogers, S., and M. Oskins, W. Characteristics of duplicate records in OCLC’s online union catalog. Library Resources & Technical Services, 37(1):59–71, 1992.Google Scholar
- 14.F. Parmentier and A. Belaid. “Bibliography References Validation Using Emergent Architecture”. In Proc. of IAPR International Conference on Document Analysis and Recognition, pages 532–535, 1995.Google Scholar
- 15.G. A. Story, L. O'Gorman, D. Fox, L. L. Schaper, and H. V. Jagadish. The rightpages image-based electronic library for alerting and browsing. IEEE Computer., 25(9):17–26, 1992.Google Scholar
- 16.A. Takasu. Probabilistic interpage analysis for article extraction from document images. In Proc. of 14th International Conference on Pattern Recognition, pages 932–935. IAPR, 1998.Google Scholar
- 17.A. Takasu and K. Aihara. “DVHMM: Variable Length Text Recognition Error Model”. In submit to 15th Internationa Conference on Pattern Recognition, pages xx–xx, 2002.Google Scholar
- 18.A. Takasu, N. Katayama, and et. al. “Approximate Matching for OCR-Processed Bibliographic Data”. In Proc. of 13th Internationa Conference on Pattern Recognition, pages 175–179, 1996.Google Scholar