Abstract
This paper presents a complete system able to categorize handwritten documents, i.e. to classify documents according to their topic. The categorization approach is based on the detection of some discriminative keywords prior to the use of the well-known tf-idf representation for document categorization. Two keyword extraction strategies are explored. The first one proceeds to the recognition of the whole document. However, the performance of this strategy strongly decreases when the lexicon size increases. The second strategy only extracts the discriminative keywords in the handwritten documents. This information extraction strategy relies on the integration of a rejection model (or anti-lexicon model) in the recognition system. Experiments have been carried out on an unconstrained handwritten document database coming from an industrial application concerning the processing of incoming mails. Results show that the discriminative keyword extraction system leads to better recall/precision tradeoffs than the full recognition strategy. The keyword extraction strategy also outperforms the full recognition strategy for the categorization task.
Similar content being viewed by others
References
Aas, K., Eikvil, L.: Text Categorisation: A Survey. Technical Report, Norwegian Computing Center (1999)
Adamek T., O’Connor N.E., Murphy N., Smeaton A.F.: Word matching using single closed contours for indexing handwritten historical documents. IJDAR 9(2), 153–165 (2007)
Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Reading, MA (1999)
Belongie S., Malik J., Puzicha J.: Shape matching and object recognition using shape contexts. IEEE Trans. on PAMI 24(4), 509–522 (2002)
Bertolami R., Bunke H.: Hidden Markov Model based ensemble methods for offline handwritten text line recognition. Pattern Recognit. 41, 3452–3460 (2008)
Bishop C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)
Cao, H., Govindaraju, V.: Vector Model Based Indexing and Retrieval of Handwritten Medical Forms. ICDAR 2007, 1, 88–92 (2007)
Chatelain C., Koch G., Heutte L., Paquet T.: Une méthode dirigée par la syntaxe pour l’extraction de champs numériques dans les courriers entrants. Traitement du Signal 23(2), 179–198 (2006)
Doermann D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70(3), 287–298 (1998)
El-Yacoubi A., Gilloux M., Sabourin R., Suen C.Y.: An HMM based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Trans. PAMI 21(8), 752–760 (1999)
El-Yacoubi M.A., Gilloux M., Bertille J.M.: A statistical approach for phrase location and recognition within a text line: an application to street name recognition. IEEE Trans. PAMI 24(2), 172–188 (2002)
Gatos, B., Konidaris, T., Ntzios, K., Pratikakis, I., Perantonis, S.J.: A Segmentation-free Approach for Keyword Search in Historical Typewritten Documents, pp. 54–58. ICDAR, Seoul (2005)
Gatos, B., Stamatopoulos, N., Louloudis, G.: ICDAR 2009 Handwriting Segmentation Contest, pp. 1393–1397. ICDAR, Seoul (2009)
Graves A., Liwicki M., Fernandez S., Bertolami R., Bunke H., Schmidhuber J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)
Grosicki, E., El Abed, H.: ICDAR 2009 Handwriting Recognition Competition, pp. 1398–1402. ICDAR, Seoul (2009)
Heutte L., Paquet T., Moreau J.V., Lecourtier Y., Olivier C.: A structural/statistical feature based vector for handwritten character recognition. Pattern Recognit. Lett. 19(7), 629–641 (1998)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Claire, N., Celine, R. (eds.) Proceedings of ECML-98, pp. 137–142 (1998)
Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for realtime applications. IEEE Trans. PAMI 19(4), 366–378 (1997)
Kimura, F., Tsuruoka, S., Miyake, Y., Shridhar, M.: A Lexicon Directed Algorithm for recognition of unconstrained handwritten words. IEICE Trans. Inf. Syst. E77-D(7), 785–793 (1994)
Koch, G., Paquet, T., Heutte, L.: Combination of contextual information for handwritten word recognition. In: 9th IAPR International Workshop on Frontiers in Handwriting Recognition, pp. 468–473. IWFHR’2004 (2004)
Koerich A.L., Sabourin R., Suen C.Y.: Vocabulary off-line handwriting recognition: a survey. Pattern Anal. Appl. 6, 97–121 (2003)
Koerich, A.L.: Rejection strategies for handwritten word recognition, pp. 479–484. IWFHR (2004)
Koerich A.L., Sabourin R., Suen C.Y.: Recognition and verification of unconstrained handwritten words. IEEE PAMI 27(10), 1509–1522 (2005)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th Annual International ACM SIGIR, pp. 37–50 (1992)
Lorette, G., Paquet, T.: La reconnaissance de l’Ecriture manuscrite, Traite IC2, Les Documents Ecrits, chap. 2, ISBN: 2-7462-1143-2, (2007)
Manmatha, R., Croft, W.B.: Word Spotting: Indexing Handwritten Archives. Intell. Multimed. Inf. Retr., pp. 43–64 (1997)
Nosary, A.: Automatique Recognition of Handwritten texts trough writer adaptation. Ph.D Thesis (in french), Universite de Rouen (2002)
Plamondon R., Srihari S.N.: On-line and off-line handwriting recognition: a comprehensive suvey. IEEE-PAMI 22(1), 63–84 (2000)
Porter M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Rath T.M., Manmatha R.: Word spotting for historical documents. IJDAR 9, 139–152 (2007)
Richard M.D., Lippmann R.P.: Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput. 3, 461–483 (1991)
Rodriguez-Serrano, J.A., Perronnin, F.: Score Normalization for HMM-based Word Spotting Using a Universal Background Model. ICFHR 2008 (2008)
Rodriguez-Serrano J.A., Perronnin F.: Handwritten Word-Spotting Using Hidden Markov Models and Universal Vocabularies. Pattern Recognit 42(9), 2106–2116 (2009)
Salton G., Buckley C.: Term-weighting approaches, in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Seni G., Cohen E.: External word segmentation of off-line handwritten text lines. Pattern Recognit. 27(1), 41–52 (1994)
Shi, Z., Setlur, S., Govindaraju, V.: A steerable directional local profile technique for extraction of handwritten arabic text lines. In: International Conference on Document Analysis and Recognition. (ICDAR’09), Spain, (2009)
Stafylakis, T., Papavassiliou, V., Katsouros, V., Carayannis, G.: Robust text-line and word segmentation for handwritten documents images. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 3393–3396. Las Vegas (2008)
Terasawa, K., Tanaka, Y.: Slit style hog feature for document image word spotting, pp. 116–120. ICDAR, Seoul (2009)
van der Zant T., Schomaker L., Haak K.: Handwritten-word spotting using biologically inspired features. IEEE Trans. PAMI 30(11), 1945–1957 (2008)
Vinciarelli, A., Luettin J.: Offline Cursive Script Recognition Based on Continuous Density HMM, pp. 493–498. IWFHR (2000)
Vinciarelli A., Bengio S., Bunke H.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. PAMI 26(6), 709–720 (2004)
Vinciarelli A.: Noisy text categorisation. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1295–1882 (2005)
Wang, W., Brakensiek, A., Kosmala, A., Rigoll, G.: HMM Based High Accuracy Off-line Cursive Handwriting Recognition by Baseline Detection Error Tolerant Feature Extraction Approach, pp. 209–218. IWFHR VII, Amsterdam (2000)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 412–420. Nashville (1997)
Yazgan, A., Saraclar, M.: Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition. IEEE ICASP Process. 1, 745 (2004)
Yin, F., Liu, C.-L.: Handwritten text line segmentation by clustering with distance metric learning. In: Proceedings of 11th International Confernence on Frontiers in Handwriting Recognition, pp. 229–234. Montreal, Canada (2008)
Zimmermann, M., Bertolami, R., Bunke, H.: Rejection strategies for offline handwritten sentence recognition. Pattern Recognit. ICPR 2004, 2, pp. 550–553 (2004)
Zimmermann M., Chappelier J.-C., Bunke H.: Offline grammar-based recognition of handwritten sentences. IEEE Trans. Pattern Anal. Mach. Intell. 18(5), 818–821 (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Paquet, T., Heutte, L., Koch, G. et al. A categorization system for handwritten documents. IJDAR 15, 315–330 (2012). https://doi.org/10.1007/s10032-011-0173-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-011-0173-5