Biblio: automatic meta-data extraction

  • Carl Staelin
  • Michael Elad
  • Darryl Greig
  • Oded Shmueli
  • Marie Vans
Original Paper

Abstract

Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.

Keywords

Document recognition Document understanding Neural networks Support vector machines Machine learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baumann, S., Malburg, M., Hein, H.-G., Hoch, R., Kieninger, T., Kuhn, N.: Document analysis at DFKI, part 2: information extraction, German Research Center for Artificial Intelligence (DFKI). In: DFKI Research Report, no. RR-95-03, Kaiserslautern, Germany (1995)Google Scholar
  2. 2.
    Bishop C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, OxfordGoogle Scholar
  3. 3.
    Burges C.J.C. (1998) A tutorial on Support Vector Machines for pattern recognition. Data Mining Knowl. Discov. 2(2): 121–167CrossRefGoogle Scholar
  4. 4.
    Casey R., Ferguson D., Mohiuddin K., Walach E. (1992) Intelligent forms processing system. Mach. Vis. Appl. 5, 143–155CrossRefGoogle Scholar
  5. 5.
    Cristianini N., John Shawe-Taylor J. (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, CambridgeGoogle Scholar
  6. 6.
    Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hones, F., Malburg, M.: OfficeMAID - a system for office mail analysis, interpretation, and delivery. In: Proceedings of the First International Conference on Document Analysis and Recognition, Kaiserslautern, Germany, pp. 253-275 (1994)Google Scholar
  7. 7.
    Gori M., Scarselli F. (1998) Are multilayer perceptrons adequate for pattern recognition and verification. IEEE Trans. Pattern Anal. Mach. Intell. 20(11): 1121–1132CrossRefGoogle Scholar
  8. 8.
    Hull, J.J., Hart, P.: The infinite memory multifunction machine (IM3). In: Proceedings of the 3rd IAPR Workshop on Document Analysis Systems, Nagano, Japan, pp. 49–58 (1998)Google Scholar
  9. 9.
    Hull J.L., TaylorS.L., (eds)(1998) Document Analysis Systems (II). World Scientific Publications, SingaporeMATHGoogle Scholar
  10. 10.
    Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings on the 14th Conference on Machine Learning (1997)Google Scholar
  11. 11.
    Krishnamoorthy M., Nagy G., Seth S., Viswanathan M. (1993) Syntactic segmentation and labeling of digitized pages and technical journals. IEEE Trans Pattern Anal. Mach. Intell. 15(7): 737–747CrossRefGoogle Scholar
  12. 12.
    Lam, S.W., Spitz, A.L., Dengel, A.: An adaptive approach to document classification and understanding. In: Proceedings IAPR Workshop on Document Analysis Systems, World Scientific, Kaiserlautern, Germany, pp. 114–134 (1994)Google Scholar
  13. 13.
    Lawrence S., Giles C.L., Kurt Bollacker K. (1999) Digital libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71Google Scholar
  14. 14.
    Liu D.C., Nocedal J. (1989) On the limited memory method for large scale optimization. Math. Program. B, 45(3): 503–528MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Lopresti D.P., Hu J., Kashi R. (2002) Document analysis systems V. In: Lecture Notes in Computer Science, vol. 2423. Springer, Berlin Heidelberg New YorkGoogle Scholar
  16. 16.
    Maarek, Y.: Automatically organizing bookmarks per contents. In: Proceedings WWW5 (1996)Google Scholar
  17. 17.
    Manevitz L.M., Yousef M. (2001) One-class SVMs for document classification. J. Mach. Learn. Res. 2, 139–154CrossRefGoogle Scholar
  18. 18.
    O’Gorman L., Kasturi R. (1995) Document image analysis. IEEE Computer Society Press, USAGoogle Scholar
  19. 19.
    Platt J. (1998) Fast training of Support Vector Machines using sequential minimal optimization. In: Scholkopf B., Burges C., Smola A. (eds) Advances in Kernel Methods–Support Vector Learning. MIT Press, CambridgeGoogle Scholar
  20. 20.
    Sahami, M., Yusufali, S., Baldonado, M.Q.W.: SONIA: a service for organizing networked information autonomously. In: Proceedings of the 3rd ACM Conference on Digital Libraries (1998)Google Scholar
  21. 21.
    Sahami, M.: Using machine learning to improve information access. In: Thesis Computer Science Department, Stanford University, Stanford, CA. (1998)Google Scholar
  22. 22.
    Salton G., Buckley C. (1988) Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5): 513–523CrossRefGoogle Scholar
  23. 23.
    Savic, D.: Automatic classification of office documents: review of available methods and techniques. Reco. Manag. Q. 3–18 (1995)Google Scholar
  24. 24.
    Shmueli, O., Staelin, C., Greig, D., Elad, M.: Classifying Semi-Structured Documents Using Image Signatures. Hewlett-Packard Laboratories, HPL-1999-65, Palo Alto, CA (1999)Google Scholar
  25. 25.
    Spitz A.L., Dengel A. (eds) (1995) Document Analysis Systems. World Scientific Publishing, SingaporeMATHGoogle Scholar
  26. 26.
    Srihari, S.N., Lam, S.W., Govindaraju, V., Srihari, R.K., Hull, J.J.: Document image understanding: research directions. In: Center of Excellence for Document Analysis and Recognition, CEDAR-TR-92-1 (1992)Google Scholar
  27. 27.
    Staelin, C.: Parameter Selection for Support Vector Machines. Hewlett-Packard Laboratories, HPL-2002-354R1, Palo Alto, CA (2002)Google Scholar
  28. 28.
    Taylor, S.L., Lipshutz, M.: Document understanding system for multiple document representations. In: Document Analysis Systems II, pp. 283–300. World Scientific, Singapore (1998)Google Scholar
  29. 29.
    Vapnik, V.: The nature of statistical learning theory. In: Statistics for Engineering and Information Science, 2nd edn. Springer, Berlin Heidelberg New York (2000)Google Scholar
  30. 30.
    Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 243–247 (1997)Google Scholar
  31. 31.
    Watanabe, T., Huang, X.: Automatic acquisition of layout knowledge for understanding business cards. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 216–220 (1997)Google Scholar
  32. 32.
    Weibel S., Oskins M., Vizine-Goetz D. (1989) Automated title page cataloging: a feasibility study. Inf. Process. Manag. 25(2): 187–203CrossRefGoogle Scholar
  33. 33.
    Weiss S., Kulikowski C. (1991) Computer Systems that Learn– Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, New YorkGoogle Scholar
  34. 34.
    Wenzel, C., Baumann, S., Jager, T.: Advances in document classification by voting of competitive approaches. In: Document Analysis Systems II, pp. 385–405 World Scientific, Singapore (1998)Google Scholar
  35. 35.
    Wenzel, C.: Supporting information extraction from printed document by lexico-semantic pattern matching. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 732–735 (1997)Google Scholar
  36. 36.
    Wnek, J.: Machine learning of generalized document templates for data extraction. In: Document Analysis Systems V. Lecture Notes in Computer Science, vol. 2423, pp. 457–468. Springer, Berlin Heidelberg New York (2002)Google Scholar
  37. 37.
    Yao X., Liu Y. (1997) A new evolutionary system for evolving artificial neural networks. IEEE Transa. Neural Netw. 8(3): 694–713CrossRefMathSciNetGoogle Scholar
  38. 38.
    Yao X., Liu Y. (1998) Making use of population information in evolutionary artificial neural networks. IEEE Tran. Syst. Man Cybern. 28(3): 417–425Google Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  • Carl Staelin
    • 1
  • Michael Elad
    • 3
  • Darryl Greig
    • 2
  • Oded Shmueli
    • 3
  • Marie Vans
    • 1
  1. 1.Hewlett-Packard LaboratoriesTechnion City, HaifaIsrael
  2. 2.Hewlett-Packard LaboratoriesBristolUK
  3. 3.The Computer Science DepartmentIsrael Institute of TechnologyHaifaIsrael

Personalised recommendations