Advertisement

g-DICE: graph mining-based document information content exploitation

  • K. C. SantoshEmail author
Original Paper

Abstract

In this paper, we present document information content (i.e. text fields) extraction technique via graph mining. Real-world users first provide a set of key text fields from the document image which they think are important. These fields are used to initialise a graph where nodes are labelled with the field names in addition to other features such as size, type and number of words, and edges are attributed with relative positioning between them. Such an attributed relational graph is then used to mine similar graphs from document images which are used to update the initial graph iteratively each time we extract them, to produce a graph model. Graph models, therefore, are employed in the absence of users. We have validated the proposed technique and evaluated its scientific impact on real-world industrial problem with the performance of 86.64 % precision and 90.80 % recall by considering all zones, viz. header, body and footer. More specifically, the proposed technique is well suited for table processing (i.e. extracting repeated patterns from the table) and it outperforms the state-of-the-art method by approximately more than 3 %.

Keywords

Document information content Table processing Key text fields Spatial relations Attributed relational graph Graph mining 

Notes

Acknowledgments

The author would like to thank Prof. Belaid, Abdel for his suggestions and his administrative assistance on the project assigned by ITESOFT, France. The authors would like to thank the National Institutes of Health (NIH) Fellows Editorial Board for their editorial assistance.

Compliance with ethical standards

Conflict of interest

None declared.

References

  1. 1.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)Google Scholar
  2. 2.
    Aiello, M., Monz, C., Todoran, L.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)zbMATHCrossRefGoogle Scholar
  3. 3.
    Aksoy, S.: Spatial relationship models for image information mining. Global Earth Observation System of Systems—Summer School on Advancing Earth Observation Data Understanding, Romania (2009)Google Scholar
  4. 4.
    Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of International Workshop on Document Analysis Systems, pp. 175–182 (2010)Google Scholar
  5. 5.
    Belaïd, A., Belaïd, Y., Valverde, L.N., Kebairi, S.: Adaptive technology for mail-order form segmentation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 689–693 (2001)Google Scholar
  6. 6.
    Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recogn. Lett. 19(3–4), 255–259 (1998)zbMATHCrossRefGoogle Scholar
  7. 7.
    Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Proceedings of International Conference on Pattern Recognition, pp. 236–240 (2002)Google Scholar
  8. 8.
    Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 516–519 (1993)Google Scholar
  9. 9.
    Chen, J., Lopresti, D.P.: Table detection in noisy off-line handwritten documents. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 399–403 (2011)Google Scholar
  10. 10.
    Cook, D., Holder, L.: Graph-based data mining. IEEE Intell. Syst. 15(2), 32–41 (2000)CrossRefGoogle Scholar
  11. 11.
    Coüasnon, B.: Dmos, a generic document recognition method: application to table structure analysis in a general and in a specific way. Int. J. Doc. Anal. Recogn. 8(2–3), 111–122 (2006)CrossRefGoogle Scholar
  12. 12.
    Diane, D.J., Cook, L.B.: Mining Graph Data. Wiley-Interscience, London (2006)Google Scholar
  13. 13.
    Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRefGoogle Scholar
  14. 14.
    Doermann, D., Tombre, K.: Handbook of Document Image Processing and Recognition. Springer, New York (2013)Google Scholar
  15. 15.
    e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recogn. 8(2–3), 144–171 (2006)CrossRefGoogle Scholar
  16. 16.
    Embley, D.W., Hurst, M., Lopresti, D.P., Nagy, G.: Table-processing paradigms: a research survey. Int. J. Doc. Anal. Recogn. 8(2–3), 66–86 (2006)CrossRefGoogle Scholar
  17. 17.
    Gallagher, B.: Matching structure and semantics: a survey on graph-based pattern matching. In: AAAI FS ’06: Papers from the 2006 AAAI Fall Symposium on Capturing and Using Patterns for Evidence Detection, pp. 45–53 (2006)Google Scholar
  18. 18.
    Garey, M.R., Johnson, D.S.: Computers and intractability; a guide to the theory of NP-completeness. W. H. Freeman & Co., New York (1990)Google Scholar
  19. 19.
    Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic Table Detection in Document Images. Springer, Berlin (2005)CrossRefGoogle Scholar
  20. 20.
    Giugno, R., Shasha, D.: Graphgrep: a fast and universal method for querying graphs. In: Proceedings of International Conference on Pattern Recognition, pp. 112–115 (2002)Google Scholar
  21. 21.
    Green, E., Krishnamoorthy, M.: Model-based analysis of printed tables. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 214–217 (1995)Google Scholar
  22. 22.
    Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R., Richter, M.M. (eds.) International Conference on Case-Based Reasoning, Volume 4626 of Lecture Notes in Computer Science, pp. 404–418 (2007)Google Scholar
  23. 23.
    Hamza, H., Belaïd, Y., Belaïd, A., Chaudhuri, B.B.: An end-to-end administrative document analysis system. In: Proceedings of International Workshop on Document Analysis Systems, pp. 175–182 (2008)Google Scholar
  24. 24.
    Hassan, T.: User-guided wrapping of pdf documents using graph matching techniques. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 631–635 (2009)Google Scholar
  25. 25.
    Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In Proceedings of International Conference on Document Analysis and Recognition, pp. 1143–1147 (2007)Google Scholar
  26. 26.
    Hori, O., Doermann, D.S.: Robust table-form structure analysis based on box-driven reasoning. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 218–221 (1995)Google Scholar
  27. 27.
    Hu, J., Kashi, R.S., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Proceedings of SPIE Conference on Document Recognition and Retrieval, pp. 291–302 (2000)Google Scholar
  28. 28.
    Hu, J., Kashi, R.S., Lopresti, D.P., Wilfong, G.T.: Evaluating the performance of table processing algorithms. Int. J. Doc. Anal. Recogn. 4(3), 140–153 (2002)CrossRefGoogle Scholar
  29. 29.
    Hurst, M.: A constraint-based approach to table structure derivation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 911–915 (2003)Google Scholar
  30. 30.
    Hurst, M.: Towards a theory of tables. Int. J. Doc. Anal. Recogn. 8(2–3), 123–131 (2006)CrossRefGoogle Scholar
  31. 31.
    Kasturi, R., O’Gorman, L., Govindaraju, V.: Document image analysis: a primer. Char. Recogn. 27(1), 3–22 (2002)Google Scholar
  32. 32.
    Kieninger, T., Dengel, A.: The t-recs table recognition and analysis system. In: Lee, S.-W., Nakano, Y. (eds.) Proceedings of International Workshop on Document Analysis Systems, Volume 1655 of Lecture Notes in Computer Science, pp. 255–269. Springer, Berlin (1998)Google Scholar
  33. 33.
    Kieninger, T., Dengel, A.: Applying the t-recs table recognition system to the business letter domain. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 518–522 (2001)Google Scholar
  34. 34.
    Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceedings of SPIE, Document Recognition V, vol. 3305, pp. 22–32 (1998)Google Scholar
  35. 35.
    Klein, B., Gokkus, S., Kieninger, T., Dengel, A.: Three approaches to “industrial” table spotting. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 513–517 (2001)Google Scholar
  36. 36.
    Klein, B., Agne, S., Dengel, A.: Results of a study on invoice-reading systems in germany. In: Marinai, S., Dengel, A. (eds.) Proceedings of International Workshop on Document Analysis Systems, Volume 3163 of Lecture Notes in Computer Science, pp. 451–462. Springer, Berlin (2004)Google Scholar
  37. 37.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)Google Scholar
  38. 38.
    Li, Y., Liu, B.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)CrossRefGoogle Scholar
  39. 39.
    Liang, J., Haralick, R.M., Phillips, I.T.: A statistically based, highly accurate text-line segmentation method. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 551–554 (1999)Google Scholar
  40. 40.
    Lopresti, D.P., Nagy, G.: A tabular survey of automated table processing. In: Chhabra, A.K., Dori, D. (eds.) Graphics Recognition, Lecture Notes in Computer Science Series, Volume 1941 of Lecture Notes in Computer Science, pp. 93–120. Springer, Berlin (1999)Google Scholar
  41. 41.
    Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: A simple and effective table detection system from document images. Int. J. Doc. Anal. Recogn. 8(2–3), 172–182 (2006)CrossRefGoogle Scholar
  42. 42.
    Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Kanungo, T., Smith, E.H.B., Hu, J., Kantor, P.B. (eds.) Proceedings of SPIE Conference on Document Recognition and Retrieval, vol. 5010, pp. 197–207 (2003)Google Scholar
  43. 43.
    Messmer, B.T., Bunke, H.: Subgraph isomorphism in polynomial time. Technical report, Institute of Computer Science and Applied Math, University of Bern (1995)Google Scholar
  44. 44.
    Nagy, G.: Twenty years of document image analysis in pami. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)CrossRefGoogle Scholar
  45. 45.
    Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. Int. J. Geogr. Inf. Sci. 11(2), 111–138 (1997)CrossRefGoogle Scholar
  46. 46.
    Ramel, J.-Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 374–378 (2003)Google Scholar
  47. 47.
    Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space Embedding. World Scientific Publishing Co. Inc., River Edge (2010)zbMATHGoogle Scholar
  48. 48.
    Santosh, K.C., Belaïd, A.: Client-driven content extraction associated with table. In: Machine Vision and Applications, pp. 277–280 (2013)Google Scholar
  49. 49.
    Santosh, K.C., Belaïd, A.: Document information extraction and its evaluation based on client’s relevance. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 35–39 (2013)Google Scholar
  50. 50.
    Santosh, K.C., Belaïd, A.: Pattern-based approach to table extraction. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds.) Proceedings of the IAPR Iberian Conference on Pattern Recognition and Image Analysis, Volume 7887 of Lecture Notes in Computer Science, pp. 766–773. Springer, Berlin (2013)Google Scholar
  51. 51.
    Saund, E.: A graph lattice approach to maintaining and learning dense collections of subgraphs as image features. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2323–2339 (2013)CrossRefGoogle Scholar
  52. 52.
    Shafait, F., Smith, R.: Table detection in heterogeneous documents. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) Proceedings of International Workshop on Document Analysis Systems, pp. 65–72 (2010)Google Scholar
  53. 53.
    Shamilian, J.H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 158–163 (1997)Google Scholar
  54. 54.
    Smith, R.W.: Hybrid page layout analysis via tab-stop detection. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 241–245 (2009)Google Scholar
  55. 55.
    Stoer, M., Wagner, F.: A simple min-cut algorithm. J. ACM 44(4), 585–591 (1997)zbMATHMathSciNetCrossRefGoogle Scholar
  56. 56.
    Tsai, W.-H., Fu, K.-S.: Error-correcting isomorphisms of attributed relational graphs for pattern analysis. IEEE Trans. Syst. Man Cybern. 9(12), 757–768 (1979)zbMATHCrossRefGoogle Scholar
  57. 57.
    Ullmann, J.R.: An algorithm for sub-graph isomorphism. J. ACM 23(1), 31–42 (1976)MathSciNetCrossRefGoogle Scholar
  58. 58.
    Wang, Y., Haralick, R.M., Phillips, I.T.: Automatic table ground truth generation and a background-analysis-based table structure extraction method. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 528–532 (2001)Google Scholar
  59. 59.
    Wang, Y., Phillips, I.T., Haralick, R.M.: Table detection via probability optimization. In: Proceedings of International Workshop on Document Analysis Systems, pp. 272–282Google Scholar
  60. 60.
    Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newslett. 5(1), 59–68 (2003)CrossRefGoogle Scholar
  61. 61.
    Watanabe, T., Luo, Q., Sugie, N.: Toward a practical document understanding of table-form documents: its framework and knowledge representation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 510–515 (1993)Google Scholar
  62. 62.
    Weber, M., Liwicki, M., Dengel, A.: Faster subgraph isomorphism detection by well-founded total order indexing. Pattern Recogn. Lett. 33(15), 2011–2019 (2012)CrossRefGoogle Scholar
  63. 63.
    Wenzel, C., Tersteegen, W.: Precise table recognition by making use of reference tables. In: Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice. Springer, Berlin, pp. 283–294 (1999)Google Scholar
  64. 64.
    Yan, X., Zhou, X.J., Han, J.: Mining closed relational graphs with connectivity constraints. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 324–333 (2005)Google Scholar
  65. 65.
    Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recogn. 7(1), 1–16 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceThe University of South DakotaVermillionUSA

Personalised recommendations