Advertisement

Metrics for evaluating performance in document analysis: application to tables

  • Ana Costa e Silva
Original Paper

Abstract

Is an algorithm with high precision and recall at identifying table-parts also good at locating tables? Several document analysis tasks require merging or splitting certain document elements to form others. The suitability of the commonly used precision and recall for such division/aggregation tasks is arguable, since their underlying assumption is that the granularity of the items at input is the same as at output. We propose a new pair of evaluation metrics that better suit document analysis’ needs and show their application to several table tasks. In the process, we present a number of robust table location algorithms with which we draw a road-map for creating Hidden Markov Models for the task.

Keywords

Performance evaluation Document analysis Table processing Metrics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bishop C.M., Svensen M., Hinton G.E.: Distinguishing Text from raphics in On-line Handwritten Ink, pp. 142–147. IWFHR ‘04, USA (2004)Google Scholar
  2. 2.
    Cafarella, M.J., Halevy, A.Y., Zhang Y., Wang D.Z., Wu, E.: Uncovering the relational web. International Workshop on the Web and Databases, Canada (2008)Google Scholar
  3. 3.
    Cesarini F., Marinai S., Sarti L., Soda G.: Trainable Table Location in document Images, vol. 3, pp. 236–240. ICPR, Canada (2002)Google Scholar
  4. 4.
    Chao, H.: Background pattern recognition in multi-page PDF documents. DLIA Workshop, UK, pp. 41–46 (2003)Google Scholar
  5. 5.
    Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining tables from large scale HTML texts. International Conference on Computational Linguistics, Germany, pp. 166–172 (2000)Google Scholar
  6. 6.
    Demšar J.: Statistical comparisons of classifiers over multiple datasets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetGoogle Scholar
  7. 7.
    Ghoson, A.M.A: Decision tree induction & clustering techniques in SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner—a comparative analysis. IABR & ITLC Conference Proceedings, Orlando, FL, USA (2010)Google Scholar
  8. 8.
    Hu J., Kashi R., Lopresti D., Wilfong G.: Medium Independent Table Detection. Document Recognition and Retrieval VII, vol. 3967, pp. 291–302. SPIE, USA (2000)Google Scholar
  9. 9.
    Hu J., Kashi R., Lopresti D., Nagy G., Wilfong G.: Evaluating the performance of table processing algorithms. Int. J. Document Anal. Recogn. 4(3), 140–153 (2002)CrossRefGoogle Scholar
  10. 10.
    Hurst, M.: The interpretation of tables in text. PhD, Edinburgh University, UK (2000)Google Scholar
  11. 11.
    Hurst M.: A Constraint-Based Approach to able Structure Derivation, pp. 911–915. ICDAR, UK (2003)Google Scholar
  12. 12.
    Kboubi F., Chabi A.H., Ahmed M.B.: Table Recognition and Combination Methods, pp. 1237–1241. ICDAR, Korea (2005)Google Scholar
  13. 13.
    Jorge A.M., Lopes A.A.: Chapter 11: “iterative part-of-speech tagging”, learning language in logic. In: Cussens, J., Dzeroski, S. (eds) Lecture Notes in Computer Science, vol. 1925, pp. 170–183. Springer, Berlin (2000)Google Scholar
  14. 14.
    Long V.: An RDF-based black-board architecture for improving table analysis, pp. 21–24. DAS, New Zealand (2006)Google Scholar
  15. 15.
    Ng, H.T., Lim, C.Y., Koo, J.L.T.: Learning to recognize tables in free text. Annual Meeting of the Association for Computational Linguistics, USA, pp. 443–450 (1999)Google Scholar
  16. 16.
    Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. SIGIR, ACM, pp. 235–242 (2003)Google Scholar
  17. 17.
    Philips I., Chhabra A.: Empirical performance evaluation of graphics recognition systems. IEEE Trans. Pattern. Anal. Mach. Intell. 21(9), 849–870 (1999)CrossRefGoogle Scholar
  18. 18.
    Pyreddi P., Croft W.B.: A system for retrieval in text tables. Technical Report 105, University of Massachusetts, Massachusetts, USA (1997)Google Scholar
  19. 19.
    Silva A.C., Jorge A.M., Torgo L.: Selection of table areas for information extraction, in Proceedings of DLIA, UK, pp. 15–18 (2003)Google Scholar
  20. 20.
    Silva A.C., Jorge A.M., Torgo L.: Design of an end-to-end method to extract information from tables. Int. J. Document Anal. Recogn. 8(2-3), 144–171 (2006)CrossRefGoogle Scholar
  21. 21.
    Silva A.C.: Learning Rich Hidden Markov Models in ocument Analysis: Table Location, pp. 843–847. ICDAR, Spain (2009)Google Scholar
  22. 22.
    Silva, A.C.: Parts that add up to a whole: a framework for the analysis of tables. PHD Thesis, Edinburgh University, UK (2010)Google Scholar
  23. 23.
    Tupaj S., Shi Z., Chang C. H., Alam H.: Extracting Tabular Information from Text Files. EECS Department, Tufts University, Medford, USA (1996)Google Scholar
  24. 24.
    Varian Hal R.: Intermediate Microeconomics: A Modern Approach, 5th edn. University of California, Berkeley, USA (1999)Google Scholar
  25. 25.
    Wang, Y.: Document analysis: a table structure understanding and zone content classification. PHD Thesis, University of Washington, USA (2002)Google Scholar
  26. 26.
    Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. International World Wide Web Conference, USA, pp. 242–250 (2002)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.The Informatics Forum, Centre for Intelligent Systems and their Applications, School of InformaticsThe University of EdinburghEdinburghScotland, UK

Personalised recommendations