Advertisement

Automatic Selection of Table Areas in Documents for Information Extraction

  • Ana Costa e Silva
  • Alípio Jorge
  • Luís Torgo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2902)

Abstract

The information contained in companies’ financial statements is valuable to several users. Much of the relevant information in such documents is contained in tables and is currently mainly extracted by hand. We propose a method that accomplishes a prior step of the task of automatically extracting information from tables in documents: selecting the lines that are likely to belong to tables. Our method has been developed by empirically analyzing a set of Portuguese companies’ financial statements using statistical and data mining techniques. Empirical evaluation indicates that more than 99% of table lines are selected after discarding at least 50% of all lines. The method can cope with the complexity of styles used in assembling information on paper and adapt its performance accordingly, thus maximizing its results.

Keywords

Information Extraction Data Mining Technique Plain Text Source Document Automatic Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Conover, W.J.: Practical nonparametric statistics. John Wiley, USA (1999)Google Scholar
  2. 2.
    Hurst, M.: The interpretation of tables in text, PhD. Thesis, School of Cognitive Science, Informatics, The University of Edinburgh, UK (2000)Google Scholar
  3. 3.
    Kieninger, T.: Table structure recognition based on robust block segmentation. In: IS&T/SPIE’s 10th Annual Symposium Electronic Imaging, USA (1998) Google Scholar
  4. 4.
    Ng, H.T., Lim, C., Koo, J.: Learning to recognize tables in free text, Association for Computational Linguistics, USA, pp. 443–450 (1999)Google Scholar
  5. 5.
    Pyreddy, P., Croft, B.: A system for retrieval in text tables, Technical report 105, Dep. Computer Science, University of Massachussets, USA (1997)Google Scholar
  6. 6.
    Silva, Ana Costa e.: Extracção da informação de tabelas em texto, MSc Thesis, Faculdade de Economia, Universidade do Porto, Portugal (2003)Google Scholar
  7. 7.
    Tupaj, S., Shi, Z., Hwa Chang, C., Alan, H.: Extracting tabular information from text files. EECS, Tufts University, Medford, USA (1996)Google Scholar
  8. 8.
    Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco (2000)Google Scholar
  9. 9.

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Ana Costa e Silva
    • 1
  • Alípio Jorge
    • 2
  • Luís Torgo
    • 2
  1. 1.Banco de PortugalPortugal
  2. 2.Faculdade de Economia do Porto, LIACCUniversidade do PortoPortugal

Personalised recommendations