Skip to main content
Log in

g-DICE: graph mining-based document information content exploitation

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we present document information content (i.e. text fields) extraction technique via graph mining. Real-world users first provide a set of key text fields from the document image which they think are important. These fields are used to initialise a graph where nodes are labelled with the field names in addition to other features such as size, type and number of words, and edges are attributed with relative positioning between them. Such an attributed relational graph is then used to mine similar graphs from document images which are used to update the initial graph iteratively each time we extract them, to produce a graph model. Graph models, therefore, are employed in the absence of users. We have validated the proposed technique and evaluated its scientific impact on real-world industrial problem with the performance of 86.64 % precision and 90.80 % recall by considering all zones, viz. header, body and footer. More specifically, the proposed technique is well suited for table processing (i.e. extracting repeated patterns from the table) and it outperforms the state-of-the-art method by approximately more than 3 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. http://www.itesoft.com.

References

  1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)

  2. Aiello, M., Monz, C., Todoran, L.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)

    Article  MATH  Google Scholar 

  3. Aksoy, S.: Spatial relationship models for image information mining. Global Earth Observation System of Systems—Summer School on Advancing Earth Observation Data Understanding, Romania (2009)

  4. Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of International Workshop on Document Analysis Systems, pp. 175–182 (2010)

  5. Belaïd, A., Belaïd, Y., Valverde, L.N., Kebairi, S.: Adaptive technology for mail-order form segmentation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 689–693 (2001)

  6. Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recogn. Lett. 19(3–4), 255–259 (1998)

    Article  MATH  Google Scholar 

  7. Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Proceedings of International Conference on Pattern Recognition, pp. 236–240 (2002)

  8. Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 516–519 (1993)

  9. Chen, J., Lopresti, D.P.: Table detection in noisy off-line handwritten documents. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 399–403 (2011)

  10. Cook, D., Holder, L.: Graph-based data mining. IEEE Intell. Syst. 15(2), 32–41 (2000)

    Article  Google Scholar 

  11. Coüasnon, B.: Dmos, a generic document recognition method: application to table structure analysis in a general and in a specific way. Int. J. Doc. Anal. Recogn. 8(2–3), 111–122 (2006)

    Article  Google Scholar 

  12. Diane, D.J., Cook, L.B.: Mining Graph Data. Wiley-Interscience, London (2006)

    Google Scholar 

  13. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  14. Doermann, D., Tombre, K.: Handbook of Document Image Processing and Recognition. Springer, New York (2013)

    Google Scholar 

  15. e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recogn. 8(2–3), 144–171 (2006)

    Article  Google Scholar 

  16. Embley, D.W., Hurst, M., Lopresti, D.P., Nagy, G.: Table-processing paradigms: a research survey. Int. J. Doc. Anal. Recogn. 8(2–3), 66–86 (2006)

    Article  Google Scholar 

  17. Gallagher, B.: Matching structure and semantics: a survey on graph-based pattern matching. In: AAAI FS ’06: Papers from the 2006 AAAI Fall Symposium on Capturing and Using Patterns for Evidence Detection, pp. 45–53 (2006)

  18. Garey, M.R., Johnson, D.S.: Computers and intractability; a guide to the theory of NP-completeness. W. H. Freeman & Co., New York (1990)

    Google Scholar 

  19. Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic Table Detection in Document Images. Springer, Berlin (2005)

    Book  Google Scholar 

  20. Giugno, R., Shasha, D.: Graphgrep: a fast and universal method for querying graphs. In: Proceedings of International Conference on Pattern Recognition, pp. 112–115 (2002)

  21. Green, E., Krishnamoorthy, M.: Model-based analysis of printed tables. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 214–217 (1995)

  22. Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R., Richter, M.M. (eds.) International Conference on Case-Based Reasoning, Volume 4626 of Lecture Notes in Computer Science, pp. 404–418 (2007)

  23. Hamza, H., Belaïd, Y., Belaïd, A., Chaudhuri, B.B.: An end-to-end administrative document analysis system. In: Proceedings of International Workshop on Document Analysis Systems, pp. 175–182 (2008)

  24. Hassan, T.: User-guided wrapping of pdf documents using graph matching techniques. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 631–635 (2009)

  25. Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In Proceedings of International Conference on Document Analysis and Recognition, pp. 1143–1147 (2007)

  26. Hori, O., Doermann, D.S.: Robust table-form structure analysis based on box-driven reasoning. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 218–221 (1995)

  27. Hu, J., Kashi, R.S., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Proceedings of SPIE Conference on Document Recognition and Retrieval, pp. 291–302 (2000)

  28. Hu, J., Kashi, R.S., Lopresti, D.P., Wilfong, G.T.: Evaluating the performance of table processing algorithms. Int. J. Doc. Anal. Recogn. 4(3), 140–153 (2002)

    Article  Google Scholar 

  29. Hurst, M.: A constraint-based approach to table structure derivation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 911–915 (2003)

  30. Hurst, M.: Towards a theory of tables. Int. J. Doc. Anal. Recogn. 8(2–3), 123–131 (2006)

    Article  Google Scholar 

  31. Kasturi, R., O’Gorman, L., Govindaraju, V.: Document image analysis: a primer. Char. Recogn. 27(1), 3–22 (2002)

    Google Scholar 

  32. Kieninger, T., Dengel, A.: The t-recs table recognition and analysis system. In: Lee, S.-W., Nakano, Y. (eds.) Proceedings of International Workshop on Document Analysis Systems, Volume 1655 of Lecture Notes in Computer Science, pp. 255–269. Springer, Berlin (1998)

    Google Scholar 

  33. Kieninger, T., Dengel, A.: Applying the t-recs table recognition system to the business letter domain. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 518–522 (2001)

  34. Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceedings of SPIE, Document Recognition V, vol. 3305, pp. 22–32 (1998)

  35. Klein, B., Gokkus, S., Kieninger, T., Dengel, A.: Three approaches to “industrial” table spotting. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 513–517 (2001)

  36. Klein, B., Agne, S., Dengel, A.: Results of a study on invoice-reading systems in germany. In: Marinai, S., Dengel, A. (eds.) Proceedings of International Workshop on Document Analysis Systems, Volume 3163 of Lecture Notes in Computer Science, pp. 451–462. Springer, Berlin (2004)

    Google Scholar 

  37. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)

  38. Li, Y., Liu, B.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)

    Article  Google Scholar 

  39. Liang, J., Haralick, R.M., Phillips, I.T.: A statistically based, highly accurate text-line segmentation method. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 551–554 (1999)

  40. Lopresti, D.P., Nagy, G.: A tabular survey of automated table processing. In: Chhabra, A.K., Dori, D. (eds.) Graphics Recognition, Lecture Notes in Computer Science Series, Volume 1941 of Lecture Notes in Computer Science, pp. 93–120. Springer, Berlin (1999)

    Google Scholar 

  41. Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: A simple and effective table detection system from document images. Int. J. Doc. Anal. Recogn. 8(2–3), 172–182 (2006)

    Article  Google Scholar 

  42. Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Kanungo, T., Smith, E.H.B., Hu, J., Kantor, P.B. (eds.) Proceedings of SPIE Conference on Document Recognition and Retrieval, vol. 5010, pp. 197–207 (2003)

  43. Messmer, B.T., Bunke, H.: Subgraph isomorphism in polynomial time. Technical report, Institute of Computer Science and Applied Math, University of Bern (1995)

  44. Nagy, G.: Twenty years of document image analysis in pami. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)

    Article  Google Scholar 

  45. Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. Int. J. Geogr. Inf. Sci. 11(2), 111–138 (1997)

    Article  Google Scholar 

  46. Ramel, J.-Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 374–378 (2003)

  47. Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space Embedding. World Scientific Publishing Co. Inc., River Edge (2010)

    MATH  Google Scholar 

  48. Santosh, K.C., Belaïd, A.: Client-driven content extraction associated with table. In: Machine Vision and Applications, pp. 277–280 (2013)

  49. Santosh, K.C., Belaïd, A.: Document information extraction and its evaluation based on client’s relevance. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 35–39 (2013)

  50. Santosh, K.C., Belaïd, A.: Pattern-based approach to table extraction. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds.) Proceedings of the IAPR Iberian Conference on Pattern Recognition and Image Analysis, Volume 7887 of Lecture Notes in Computer Science, pp. 766–773. Springer, Berlin (2013)

    Google Scholar 

  51. Saund, E.: A graph lattice approach to maintaining and learning dense collections of subgraphs as image features. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2323–2339 (2013)

    Article  Google Scholar 

  52. Shafait, F., Smith, R.: Table detection in heterogeneous documents. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) Proceedings of International Workshop on Document Analysis Systems, pp. 65–72 (2010)

  53. Shamilian, J.H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 158–163 (1997)

  54. Smith, R.W.: Hybrid page layout analysis via tab-stop detection. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 241–245 (2009)

  55. Stoer, M., Wagner, F.: A simple min-cut algorithm. J. ACM 44(4), 585–591 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  56. Tsai, W.-H., Fu, K.-S.: Error-correcting isomorphisms of attributed relational graphs for pattern analysis. IEEE Trans. Syst. Man Cybern. 9(12), 757–768 (1979)

    Article  MATH  Google Scholar 

  57. Ullmann, J.R.: An algorithm for sub-graph isomorphism. J. ACM 23(1), 31–42 (1976)

    Article  MathSciNet  Google Scholar 

  58. Wang, Y., Haralick, R.M., Phillips, I.T.: Automatic table ground truth generation and a background-analysis-based table structure extraction method. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 528–532 (2001)

  59. Wang, Y., Phillips, I.T., Haralick, R.M.: Table detection via probability optimization. In: Proceedings of International Workshop on Document Analysis Systems, pp. 272–282

  60. Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newslett. 5(1), 59–68 (2003)

    Article  Google Scholar 

  61. Watanabe, T., Luo, Q., Sugie, N.: Toward a practical document understanding of table-form documents: its framework and knowledge representation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 510–515 (1993)

  62. Weber, M., Liwicki, M., Dengel, A.: Faster subgraph isomorphism detection by well-founded total order indexing. Pattern Recogn. Lett. 33(15), 2011–2019 (2012)

    Article  Google Scholar 

  63. Wenzel, C., Tersteegen, W.: Precise table recognition by making use of reference tables. In: Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice. Springer, Berlin, pp. 283–294 (1999)

  64. Yan, X., Zhou, X.J., Han, J.: Mining closed relational graphs with connectivity constraints. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 324–333 (2005)

  65. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recogn. 7(1), 1–16 (2004)

    Article  Google Scholar 

Download references

Acknowledgments

The author would like to thank Prof. Belaid, Abdel for his suggestions and his administrative assistance on the project assigned by ITESOFT, France. The authors would like to thank the National Institutes of Health (NIH) Fellows Editorial Board for their editorial assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. C. Santosh.

Ethics declarations

Conflict of interest

None declared.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santosh, K.C. g-DICE: graph mining-based document information content exploitation. IJDAR 18, 337–355 (2015). https://doi.org/10.1007/s10032-015-0253-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-015-0253-z

Keywords

Navigation