Clustering Document Images Using Graph Summaries

Barbu, Eugen; Héroux, Pierre; Adam, Sébastien; Trupin, Eric

doi:10.1007/11510888_20

Eugen Barbu²⁰,
Pierre Héroux²⁰,
Sébastien Adam²⁰ &
…
Eric Trupin²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2072 Accesses

Abstract

Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervized classification method, we can group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we found frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of “symbols” represents the description of a document. We present results obtained on a corpus of graphical document images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Antonacopoulos, A.: Introduction to Document Image Analysis (1996)
Google Scholar
Nagy, G.: Twenty years of document analysis in PAMI. IEEE PAMI 22, 38–62 (2000)
Google Scholar
Pavlidis, T.: Algorithms or Graphics and Image Processing. Computer Science Press, Rockville (1982)
Google Scholar
Bagdanov, A.D., Worring, M.: Fine-grained Document Genre Classification Using First Order Random Graphs. In: ICDAR 2001, pp. 79–90 (2001)
Google Scholar
Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newsl. 5(1), 59–68 (2003)
Article Google Scholar
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)
Google Scholar
Termier, A., Rousset, M., Sebag, M.: Mining XML Data with Frequent Trees. In: DBFusion Workshop 2002, pp. 87–96 (2002)
Google Scholar
Blostein, D., Zanibbi, R., Nagy, G., Harrap, R.: Document Representations. In: Lladós, J., Kwon, Y.-B. (eds.) GREC 2003. LNCS, vol. 3088. Springer, Heidelberg (2004)
Google Scholar
Khotazad, A., Hong, Y.H.: Invariant Image recognition by Zernike Moments. IEEE PAMI 12(5) (May 1990)
Google Scholar
Gordon, A.D.: Classification, 2nd edn. (1999)
Google Scholar
Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 58(2), 159–179 (1985)
Article Google Scholar
Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis based on the L 1 Norm and Related Methods, pp. 405–416 (1987)
Google Scholar
Tabbone, S., Wendling, L., Tombre, K.: Matching of graphical symbols in line-drawing images using angular signature information. Int’l. Journal on Document Analysis and Recognition 6(2), 115–125 (2003)
Article Google Scholar
Seno, M., Kuramochi, M., Karypis, G.: PAFI, A Pattern Finding Toolkit (2003), http://www.cs.umn.edu/~karypis
Dumais, S.T.: Improving the retrieval information from external resources. Behaviour Research Methods, Instruments and Computers 23(2), 229–236 (1991)
Article Google Scholar
Ballard, D.H., Brown, C.M.: Computer Vision. Prentice Hall, Englewood Cliffs (1982)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire Perception – Systèmes – Information, FRE CNRS 2645, Université de Rouen, UFR des Sciences & Techniques, Place Emile Blondel, 76821, Mont-Saint-Aignan Cedex, France
Eugen Barbu, Pierre Héroux, Sébastien Adam & Eric Trupin

Authors

Eugen Barbu
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Héroux
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Adam
View author publications
You can also search for this author in PubMed Google Scholar
Eric Trupin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner
Institute of Media and Information Technology, Chiba University, Japan
Atsushi Imiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barbu, E., Héroux, P., Adam, S., Trupin, E. (2005). Clustering Document Images Using Graph Summaries. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_20

Download citation

DOI: https://doi.org/10.1007/11510888_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics