Table Detection from Plain Text Using Machine Learning and Document Structure

  • Juanzi Li
  • Jie Tang
  • Qiang Song
  • Peng Xu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3841)

Abstract

Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naïve Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining tables from large scale HTML Text. In: the Proc. of 18th international conference on Computational Linguistics, Saarbruecken, Germany (2002)Google Scholar
  2. 2.
    Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. In: the Proc. Of WWW 2002, Honolulu, Hawaii (2002)Google Scholar
  3. 3.
    Klein, B., Gokkus, S., Kieninger, T.: Three approaches to “industrial” table spotting. In: Proc. 6th Int’l Conf. Document Analysis and Recognition, pp. 513–517 (2001)Google Scholar
  4. 4.
    Ng, H.T., Lim, C.Y., Koo, J.L.T.: Learning to Recognize Tables in Free Text. In: Proc. of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL 1999), pp. 443–450 (1999)Google Scholar
  5. 5.
    Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table Extraction Using Conditional Random Fields. In: Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2003)Google Scholar
  6. 6.
    Pyreddy, P., Croft, W.: TintinL: A system for retrieval in text tables. In: Proc. the second international conference on digital libraries (1997)Google Scholar
  7. 7.
    Tang, J., Li, J., Lu, H., Liang, B., Huang, X., Wang, K.: iASA: Learning to Annotate the Semantic Web. Journal on Data Semantics (2005)Google Scholar
  8. 8.
    Tengli, A., Yang, Y., Ma, N.: Learning Table Extraction from Examples. In: Proc. Of 20th international conference on computational linguisticsGoogle Scholar
  9. 9.
    Wang, Y., Phillips, T.P., Haralick, R.M.: Table structure understanding and its performance evaluation. Pattern Recognition 37(7), 1479–1497 (2004)CrossRefGoogle Scholar
  10. 10.
    Zhang, K., Xu, P., Li, J., Wang, K.: Optimized hierarchy clustering based extraction for document logical structure. Journal of Tsinghua Science and Technology 45(4) (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Juanzi Li
    • 1
  • Jie Tang
    • 1
  • Qiang Song
    • 1
  • Peng Xu
    • 1
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityP.R. China

Personalised recommendations