An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

  • Margherita Berardi
  • Michele Lapi
  • Donato Malerba
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3163)


In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different machine learning techniques are applied. Experimental results on a set of biomedical multi-page documents are discussed and future directions are drawn.


Document Image Omission Error Layout Structure Logical Component Training Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a Broad class of documents. International Journal of Document Analysis and Recognition, Springer Berlin Heidelberg, Germany (2002)Google Scholar
  2. 2.
    Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. Int. Journal on Document Analysis and Recognition 4(1), 2–17 (2001)CrossRefGoogle Scholar
  3. 3.
    Berardi, M., Ceci, M., Esposito, F., Malerba, D.: Learning Logic Programs for Layout Analysis Correction. In: Proc. of the Twentieth International Conference on Machine Learning, Washington, DC (2003)Google Scholar
  4. 4.
    Ceci, M., Malerba, D., Lapi, M., Esposito, F.: Automated Classification of Web Documents into a Hierarchy of Categories. In: Klopotek, Ö.M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, pp. 59–68. Springer, Berlin (2003)Google Scholar
  5. 5.
    Ceci, M., Malerba, D.: Web-pages Classification into a Hierarchy of Categories. In: Proc. of the BCS-IRSG 25th European Conference on Information Retrieval Research (ECIR 2003), Pisa, Italy (2003)Google Scholar
  6. 6.
    Dengel, A.R.: Making Documents Work: Challenges for Document Understanding. In: Proc. of the Seventh Int. Conf. on Document Analysis and Recognition (ICDAR 2003), pp. 1026–1036. IEEE Computer Society Press, Edinburgh (2003)CrossRefGoogle Scholar
  7. 7.
    Fan, X., Sheng, F., Ng, P.A.: DOCPROS: A Knowledge-Based Personal Document Management System. In: Proc. of the 10th International Workshop on Database and Expert Systems Applications (DEXA Workshop), pp. 527–531 (1999)Google Scholar
  8. 8.
    Shah, K.P., Perez-Iratxeta, C., Bork, P., Andrade, M.A.: Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 4(1), 20 (2003)CrossRefGoogle Scholar
  9. 9.
    Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proc. of Fourth IAPR International Workshop on Document Analysis Systems (DAS 2000), pp. 99–111 (2000)Google Scholar
  10. 10.
    Malerba, D., Esposito, F., Lisi, F.A.: Learning recursive theories with ATRE. In: Prade, H. (ed.) Proceedings of the Thirteenth European Conference on Artificial Intelligence, pp. 435–439. John Wiley & Sons, Chichester (1998)Google Scholar
  11. 11.
    Malerba, D., Esposito, F., Lisi, F.A., Altamura, O.: Automated Discovery of Dependencies Between Logical Components in Document Image Understanding. In: Proc. of the Sixth Int. Conference on Document Analysis and Recognition, Seattle, WA, pp. 174–178 (2001)Google Scholar
  12. 12.
    Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)CrossRefGoogle Scholar
  13. 13.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  14. 14.
    Rindflesch, T., Aronson, A.: Semantic processing in information retrieval. In: Safran, C. (ed.) Seventeenth Annual Symposium on Computer Applications in Medical Care (SCAMC 1993), pp. 611–615. McGraw-Hill Inc., New York (1993)Google Scholar
  15. 15.
    Salton, G.: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)Google Scholar
  16. 16.
    Tang, Y.Y., Yan, C.D., Suen, C.Y.: Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering 6(1), 3–21 (1994)CrossRefGoogle Scholar
  17. 17.
    Tsujimoto, S., Asada, H.: Understanding Multi-articled Documents. In: Proc. of the Tenth Int. Conf. on Pattern Recognition, Atlantic City, N.J., pp. 551–556 (1990)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Margherita Berardi
    • 1
  • Michele Lapi
    • 1
  • Donato Malerba
    • 1
  1. 1.Dipartimento di InformaticaUniversità degli Studi di BariBari

Personalised recommendations