Towards Versatile Document Analysis Systems

  • Henry S. Baird
  • Matthew R. Casey
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

The research goal of highly versatile document analysis systems, capable of performing useful functions on the great majority of document images, seems to be receding, even in the face of decades of research. One family of nearly universally applicable capabilities includes document image content extraction tools able to locate regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. The severity of the methodological problems is suggested by the lack of agreement within the R&D community on even what is meant by a representative set of samples in this context. Even when this is agreed, it is often not clear how sufficiently large sets for training and testing can be collected and ground truthed. Perhaps this can be alleviated by discovering a principled way to amplify sample sets using synthetic variations. We will then need classification methodologies capable of learning automatically from these huge sample sets in spite of their poorly parameterized—or unparameterizable—distributions. Perhaps fast expected-time approximate k-nearest neighbors classifiers are a good solution, even if they tend to require enormous data structures: hashed k-d trees seem promising. We discuss these issues and report recent progress towards their resolution.

Keyword: versatile document analysis systems, DAS methodology, document image content extraction, classification, k Nearest Neighbors, k-d trees, CART, spatial data structures, computational geometry, hashing

Keywords

Training Sample Document Image Content Type Radius Search Synthetic Variation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Pavlidis, T.: Thirty years at the pattern recognition front. In: King-Sun Fu Prize Lecture, 11th ICPR, address = Barcelona, Spain (2000)Google Scholar
  2. 2.
    Nagy, G., Seth, S.: Modern optical character recognition (1996)Google Scholar
  3. 3.
    Nagy, G.: Twenty years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine IntelligenceGoogle Scholar
  4. 4.
    Phillips, I.T., Haralick, S.C., R.M.: Cd-rom document database standard. In: Proc., 2nd IAPR ICDAR, pp. 478–483 (1993)Google Scholar
  5. 5.
    Sarkar, P., Nagy, G.: Style consistent classification of isogenous patterns. IEEE Trans. on PAMI 27 (2005)Google Scholar
  6. 6.
    Veeramachaneni, S., Nagy, G.: Style context with second order statistics. IEEE Trans. on PAMI 27 (2005)Google Scholar
  7. 7.
    Baird, H.S., Moll, M.A., Nonnemaker, J., C, M.R., Delorenzo, D.L.: Versatile document image content extraction. In: Proc., SPIE/IS&T Document Recognition & Retrieval XII Conf., San Jose, CA (2006)Google Scholar
  8. 8.
    Ho, T.K., Baird, H.S.: Large-scale simulation studies in image pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 1067–1079 (1997)CrossRefGoogle Scholar
  9. 9.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)MATHGoogle Scholar
  10. 10.
    Samet, H.: The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading (1990)Google Scholar
  11. 11.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18, 509–517 (1975)MATHCrossRefGoogle Scholar
  12. 12.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975)MATHCrossRefGoogle Scholar
  13. 13.
    Freidman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3, 209–226 (1977)CrossRefGoogle Scholar
  14. 14.
    Lee, D.T., Wong, C.K.: Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Inf 9, 23–29 (1977)MATHMathSciNetCrossRefGoogle Scholar
  15. 15.
    Knuth, D.E.: Computer Modern Typefaces. Addison Wesley, Reading (1986)MATHGoogle Scholar
  16. 16.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth& Brooks/Cole (1984)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Henry S. Baird
    • 1
  • Matthew R. Casey
    • 1
  1. 1.Computer Science & Engineering DeptLehigh UniversityBethlehemUSA

Personalised recommendations