Skip to main content

Clustering Document Images Using Graph Summaries

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

  • 2072 Accesses

Abstract

Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervized classification method, we can group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we found frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of “symbols” represents the description of a document. We present results obtained on a corpus of graphical document images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Antonacopoulos, A.: Introduction to Document Image Analysis (1996)

    Google Scholar 

  2. Nagy, G.: Twenty years of document analysis in PAMI. IEEE PAMI 22, 38–62 (2000)

    Google Scholar 

  3. Pavlidis, T.: Algorithms or Graphics and Image Processing. Computer Science Press, Rockville (1982)

    Google Scholar 

  4. Bagdanov, A.D., Worring, M.: Fine-grained Document Genre Classification Using First Order Random Graphs. In: ICDAR 2001, pp. 79–90 (2001)

    Google Scholar 

  5. Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newsl. 5(1), 59–68 (2003)

    Article  Google Scholar 

  6. Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)

    Google Scholar 

  7. Termier, A., Rousset, M., Sebag, M.: Mining XML Data with Frequent Trees. In: DBFusion Workshop 2002, pp. 87–96 (2002)

    Google Scholar 

  8. Blostein, D., Zanibbi, R., Nagy, G., Harrap, R.: Document Representations. In: Lladós, J., Kwon, Y.-B. (eds.) GREC 2003. LNCS, vol. 3088. Springer, Heidelberg (2004)

    Google Scholar 

  9. Khotazad, A., Hong, Y.H.: Invariant Image recognition by Zernike Moments. IEEE PAMI 12(5) (May 1990)

    Google Scholar 

  10. Gordon, A.D.: Classification, 2nd edn. (1999)

    Google Scholar 

  11. Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 58(2), 159–179 (1985)

    Article  Google Scholar 

  12. Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis based on the L 1 Norm and Related Methods, pp. 405–416 (1987)

    Google Scholar 

  13. Tabbone, S., Wendling, L., Tombre, K.: Matching of graphical symbols in line-drawing images using angular signature information. Int’l. Journal on Document Analysis and Recognition 6(2), 115–125 (2003)

    Article  Google Scholar 

  14. Seno, M., Kuramochi, M., Karypis, G.: PAFI, A Pattern Finding Toolkit (2003), http://www.cs.umn.edu/~karypis

  15. Dumais, S.T.: Improving the retrieval information from external resources. Behaviour Research Methods, Instruments and Computers 23(2), 229–236 (1991)

    Article  Google Scholar 

  16. Ballard, D.H., Brown, C.M.: Computer Vision. Prentice Hall, Englewood Cliffs (1982)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Barbu, E., Héroux, P., Adam, S., Trupin, E. (2005). Clustering Document Images Using Graph Summaries. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_20

Download citation

  • DOI: https://doi.org/10.1007/11510888_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26923-6

  • Online ISBN: 978-3-540-31891-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics