Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space

  • Atsuhiro Takasu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2458)


It is important to utilize retrospective documents when constructing a large digital library. This paper proposes a method for analyzing recognized bibliographic strings using an extended hidden Markov model. The proposed method enables analysis of erroneous bibliographic strings and integrates many documents accumulated as printed articles in a citation index. The proposed method has the advantage of providing a robust bibliographic matching function using the statistical description of the syntax of bibliographic strings, a language model and an Optical Character Recognition (OCR) error model. The method also has the advantage of reducing the cost of preparing training data for parameter estimation, using records in the bibliographic database.


Digital Library Document Image Query Term Optical Character Recognition Bibliographic Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    F. H. Ayres, J. A. W. Huggill, and E. J. Yannakoudakis. The universal standard bibligraphic code (usbc): its use for clearing, merging and controlling large databases. Program— Automated Library and Information Systems, 22(2):117–132, 1988.CrossRefGoogle Scholar
  2. 2.
    A. Belaid, J. C. Anigbogu, and Y. Chenevoy. Qualitative Analysis of Low-Level Logical Structures. In Proc. of International Conference on Electronic Publishing, pages 435–446, 1994.Google Scholar
  3. 3.
    H. Bunke and P.S.P. Wang, editors. Handbook of Character Recoginition and Document Image Analysis. World Scientific, 1997.Google Scholar
  4. 4.
    CrossRef The central source for reference linking:. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
  5. 5.
    C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An Automatic Citation Indexing System. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.Google Scholar
  6. 6.
    P. Goyal. An investigation of different string coding methods. Journal of the American Society for Information Science, 35(4):248–252, 1984.CrossRefGoogle Scholar
  7. 7.
    P. Goyal. Duplicate record identification in bibiliographic databases. Information Systems, 12(3):239–242, 1987.CrossRefGoogle Scholar
  8. 8.
    The Digital Object Identifier:. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
  9. 9.
    S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274–288, March 1987.CrossRefGoogle Scholar
  10. 10.
    Karen Kukich. “Techniques for Automtically Correcting Words in Text”. ACM Computing Surveys, 24(4):377–439, 1992.CrossRefGoogle Scholar
  11. 11.
    S. Lawrence, C. L. Giles, and K. D. Bollacker. Digital libraries and autonmous citation indexing. IEEE Computer, 32(6):67–71, June 1999.Google Scholar
  12. 12.
    Y. Li, D. Lopresti, and A. Tomkins. “Validation of Document Image Defect Models for Optical Character Recognition”. In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137–150, 1994.Google Scholar
  13. 13.
    T. O'Neill, E., A. Rogers, S., and M. Oskins, W. Characteristics of duplicate records in OCLC’s online union catalog. Library Resources & Technical Services, 37(1):59–71, 1992.Google Scholar
  14. 14.
    F. Parmentier and A. Belaid. “Bibliography References Validation Using Emergent Architecture”. In Proc. of IAPR International Conference on Document Analysis and Recognition, pages 532–535, 1995.Google Scholar
  15. 15.
    G. A. Story, L. O'Gorman, D. Fox, L. L. Schaper, and H. V. Jagadish. The rightpages image-based electronic library for alerting and browsing. IEEE Computer., 25(9):17–26, 1992.Google Scholar
  16. 16.
    A. Takasu. Probabilistic interpage analysis for article extraction from document images. In Proc. of 14th International Conference on Pattern Recognition, pages 932–935. IAPR, 1998.Google Scholar
  17. 17.
    A. Takasu and K. Aihara. “DVHMM: Variable Length Text Recognition Error Model”. In submit to 15th Internationa Conference on Pattern Recognition, pages xx–xx, 2002.Google Scholar
  18. 18.
    A. Takasu, N. Katayama, and et. al. “Approximate Matching for OCR-Processed Bibliographic Data”. In Proc. of 13th Internationa Conference on Pattern Recognition, pages 175–179, 1996.Google Scholar
  19. 19.
    K. Y. Wong, R. G. Casey, and F. M. Wahl. “Document Analysis System”. IBM journal Research and Development, 26(6):647–656, 1982.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Atsuhiro Takasu
    • 1
  1. 1.National Institute of InformaticsChiyoda-ku TokyoJapan

Personalised recommendations