Clustering Document Images Using Graph Summaries

  • Eugen Barbu
  • Pierre Héroux
  • Sébastien Adam
  • Eric Trupin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3587)


Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervized classification method, we can group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we found frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of “symbols” represents the description of a document. We present results obtained on a corpus of graphical document images.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Antonacopoulos, A.: Introduction to Document Image Analysis (1996)Google Scholar
  2. 2.
    Nagy, G.: Twenty years of document analysis in PAMI. IEEE PAMI 22, 38–62 (2000)Google Scholar
  3. 3.
    Pavlidis, T.: Algorithms or Graphics and Image Processing. Computer Science Press, Rockville (1982)Google Scholar
  4. 4.
    Bagdanov, A.D., Worring, M.: Fine-grained Document Genre Classification Using First Order Random Graphs. In: ICDAR 2001, pp. 79–90 (2001)Google Scholar
  5. 5.
    Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newsl. 5(1), 59–68 (2003)CrossRefGoogle Scholar
  6. 6.
    Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)Google Scholar
  7. 7.
    Termier, A., Rousset, M., Sebag, M.: Mining XML Data with Frequent Trees. In: DBFusion Workshop 2002, pp. 87–96 (2002)Google Scholar
  8. 8.
    Blostein, D., Zanibbi, R., Nagy, G., Harrap, R.: Document Representations. In: Lladós, J., Kwon, Y.-B. (eds.) GREC 2003. LNCS, vol. 3088. Springer, Heidelberg (2004)Google Scholar
  9. 9.
    Khotazad, A., Hong, Y.H.: Invariant Image recognition by Zernike Moments. IEEE PAMI 12(5) (May 1990)Google Scholar
  10. 10.
    Gordon, A.D.: Classification, 2nd edn. (1999)Google Scholar
  11. 11.
    Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 58(2), 159–179 (1985)CrossRefGoogle Scholar
  12. 12.
    Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis based on the L 1 Norm and Related Methods, pp. 405–416 (1987)Google Scholar
  13. 13.
    Tabbone, S., Wendling, L., Tombre, K.: Matching of graphical symbols in line-drawing images using angular signature information. Int’l. Journal on Document Analysis and Recognition 6(2), 115–125 (2003)CrossRefGoogle Scholar
  14. 14.
    Seno, M., Kuramochi, M., Karypis, G.: PAFI, A Pattern Finding Toolkit (2003),
  15. 15.
    Dumais, S.T.: Improving the retrieval information from external resources. Behaviour Research Methods, Instruments and Computers 23(2), 229–236 (1991)CrossRefGoogle Scholar
  16. 16.
    Ballard, D.H., Brown, C.M.: Computer Vision. Prentice Hall, Englewood Cliffs (1982)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Eugen Barbu
    • 1
  • Pierre Héroux
    • 1
  • Sébastien Adam
    • 1
  • Eric Trupin
    • 1
  1. 1.Laboratoire Perception – Systèmes – Information, FRE CNRS 2645Université de Rouen, UFR des Sciences & TechniquesMont-Saint-Aignan CedexFrance

Personalised recommendations