Text Verification in an Automated System for the Extraction of Bibliographic Data

  • George R. Thoma
  • Glenn Ford
  • Daniel Le
  • Zhirong Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2423)


An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.


Document Image Bibliographic Data Journal Issue Bibliographic Record Affiliation Field 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Automating the production of bibliographic records for MEDLINE. An R&D report of the Communications Engineering Branch, LHNCBC, NLM. Bethesda, Maryland. September 2001, 91pp.
  2. 2.
    Hauser SE, Le DX, Thoma GR. Automated zone correction in bitmapped document images. Proc. SPIE: Document Recognition and Retrieval VII, Vol. 3967, San Jose CA, January 2000, 248–58.Google Scholar
  3. 3.
    Kim J, Le DX, Thoma GR. Automated Labeling in Document Images. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 111–22.Google Scholar
  4. 4.
    Ford GM, Hauser SE, Thoma GR. Automatic reformatting of OCR text from biomedical journal articles. Proc.1999 Symposium on Document Image Understanding Technology, College Park, MD: University of Maryland Institute for Advances in Computer Studies; 321–25.Google Scholar
  5. 5.
    Ford G, Hauser SE, Le DX, Thoma GR. Pattern matching techniques for correcting low confidence OCR words in a known context. Proc. SPIE, Vol. 4307, Document Recognition and Retrieval VIII, January 2001, pp. 241–9.Google Scholar
  6. 6.
    Lasko TA, Hauser SE. Approximate string matching algorithms for limited-vocabulary OCR output correction. Proc. SPIE, Vol. 4307, Document Recognition and Retrieval VIII, January 2001, pp. 232–40.Google Scholar
  7. 7.
    Li Z. Character verification. Internal technical report, Communications Engineering Branch, August 23, 2001.Google Scholar
  8. 10.
    Moore A. The tricks to make OCR work better. Imaging Magazine. June 1994.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • George R. Thoma
    • 1
  • Glenn Ford
    • 1
  • Daniel Le
    • 1
  • Zhirong Li
    • 1
  1. 1.National Library of MedicineBethesda, Maryland

Personalised recommendations