Abstract
This paper presents a semi-supervised document image classification system that aims to be integrated into a commercial document reading software.
This system is asserted like an annotation help. From a set of unknown document images given by a human operator, the system computes regrouping hypothesis of same physical layout images and proposes them to the operator. Then he can correct them, validate them, keeping in mind that his objective is to have homogeneous groups of images. These groups will be used for the training of the supervised document image classifier. Our system contains N feature spaces and a metric function for each of them. These allow to compute the similarity between two points of the same space. After projecting each image in these N feature spaces, the system builds N hierarchical agglomerative classification trees (hac) corresponding to each feature space. The proposals for regroupings formulated by the various hac are confronted and merged. Results, evaluated by the number of corrections done by the operator are presented on different image sets.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Koch, G., Heutte, L., Paquet, T.: Numerical sequence extraction in handwritten incoming mail documents. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, ICDAR 2003, pp. 369–373 (2003)
Clavier, E.: Stratégies de tri: un système de tri des formulaires. Thèse de doctorat, Université de Caen (2000)
Carmagnac, F., Héroux, P., Trupin, E.: Distance Based Strategy for Document Image Classification. In: Advances in Pattern Recognition. LNCS, Springer, Heidelberg (2004) (to be published)
Muslea, I., Minton, S., Knoblock, C.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pp. 435–442 (2002)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Inc., London (1990)
Cornuéjols, A., Miclet, L.: Apprentissage artificiel - concepts et algorithmes. Eyrolles (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)
Ribert, A.: Structuration évolutive de données: Application à la construction de classifieurs distribués. Thèse de doctorat, Université de Rouen (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Carmagnac, F., Héroux, P., Trupin, É. (2004). Multi-view hac for Semi-supervised Document Image Classification. In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-28640-0_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive