Advertisement

Design of a Digital Library for Early 20th Century Medico-legal Documents

  • George R. Thoma
  • Song Mao
  • Dharitri Misra
  • John Rees
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4172)

Abstract

The research value of important government documents to historians of medicine and law is enhanced by a digital library of such a collection being designed at the U.S. National Library of Medicine. This paper presents work toward the design of a system for preservation and access of this material, focusing mainly on the automated extraction of descriptive metadata needed for future access. Since manual entry of these metadata for thousands of documents is unaffordable, automation is required. Successful metadata extraction relies on accurate classification of key textlines in the document. Methods are described for the optimal scanning alternatives leading to high OCR conversion performance, and a combination of a Support Vector Machine (SVM) and Hidden Markov Model (HMM) for the classification of textlines and metadata extraction. Experimental results from our initial research toward an optimal textline classifier and metadata extractor are given.

Keywords

Support Vector Machine Hide Markov Model Digital Library Case Body Toner Level 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Public Law 59-384, repealed in 1938 by 21 U.S.C. Sec 329 (a). And U.S Food and Drug Administration, Federal Food and Drugs Act of 1906 (The ”Wiley Act”), February 3 (2006), http://www.fda.gov/opacom/laws/wileyact.htm
  2. 2.
    Mao, S., Misra, D., Seamans, J., Thoma, G.R.: Design Strategies for a Prototype Electronic Preservation System for Biomedical Documents. In: Proc. IS&T Archiving Conference, Washington DC, pp. 48–53 (2005)Google Scholar
  3. 3.
    DSpace at MIT, http://www.dspace.org
  4. 4.
    Java Remote Method Invocation, http://java.sun.com/products/jdk/rmi/
  5. 5.
    Cortes, C., Vapnik, V.: Support-vector Network. Machine Learning 20, 273–297 (1995)MATHGoogle Scholar
  6. 6.
    Rabiner, L.R., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993)Google Scholar
  7. 7.
    Mao, S., Mansukhani, P., Thoma, G.R.: Feature Subset Selection and Classification using Class Syntax Models for Document Logical Entity Recognition. In: Proc. IEEE International Conference on Image Processing, Atlanta, GA (2006) (submitted)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • George R. Thoma
    • 1
  • Song Mao
    • 1
  • Dharitri Misra
    • 1
  • John Rees
    • 1
  1. 1.U.S. National Library of MedicineBethesdaUSA

Personalised recommendations