Abstract
In this paper, we report our effort in building a multi linguistic structure Cursive and Language Adaptive Methodology (CALAM) to create, annotate and validate linguistic dataset. CALAM provides a way for fetching and retrieval of information in a scientific and systematic manner through design and development of an annotated corpus of handwritten text image. It is a useful tool to annotate multi-lingual handwritten image dataset (Hindi, English, and Urdu etc.). The annotation is not limited with the grammatical tagging, but structural markup is also done. Annotation of handwritten text image is done in a hierarchical manner starting from handwritten form to segmented lines, words, and components. The component level markup is useful for finding strokes and list of ligatures in Urdu language. Along with a hierarchical access structure, CALAM provides the functionalities of Indexing, Insertion, Searching and Deletion of words and phrases in handwritten form. Apart from dataset fetching and retrieval it also automatically generates XML tagged file for each annotated handwritten text image for all dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Christian, V.G., Michel, P., Stefan, K., Philippe, B.: The IRESTE on/off (IRONOFF) dual handwriting database. In: International Conference Document Analysis and Recognition, pp. 455–458 (1999)
Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: International Conference Document Analysis and Recognition, pp. 705–708 (1999)
Marti, U., Bunke, H.: The IAM-database: an English sentence database for off-line handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002)
Lecun, Y., et al.: The MNIST database of handwritten digits (image) (1999)
Waqas, M., Lei, C., Nobile, N., Suen, C.Y.: A new large Urdu database for off-line handwriting recognition. In: International Conference Image Analysis and Processing. Lecture Notes in Computer Science, pp. 538–546, Italy (2009)
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.: A database of unconstrained handwritten Bangla and English mixed script document image. Int. J. Doc. Anal. Recogn. (IJDAR) 15, 71–83 (2012)
Raza, A., Siddiqi, I., Abidi, A., Arif, F.: An unconstrained benchmark Urdu handwritten sentence database with automatic line segmentation. In: International Conference Frontiers in Handwritten Recognition (ICFHR), pp. 491–496 (2012)
J. Hull: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 550–554 (1994)
Wilkinson, R., Geist, J., Janet, S., Grother, P., Burges, C., Creecy, R., Hammond, B., Hull, J., Larsen, N., Vogl, T., Wilson, C.: The first census optical character recognition systems: NISTIR 4912. The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg (1992)
Saito, T., Yamada, H., Yamamoto, K.: On the data base ETL 9 of hand printed characters in JIS Chinese characters and its analysis. IEICE Trans. 757–764 (1985)
Dae-Hwan, K.I.M., Hwang, Y., Sang-Tae, P.A.R.K., Eun-Jung, K.I.M., Sang-Hoon, P.A.E.K., Sung-Yang, B.A.N.G.: Handwritten Korean character image database PE92. In: International Conference Document Analysis and Recognition (ICDAR), pp. 470–473 (1993)
Dash, N.S.: Corpus Linguistics: A General Introduction. CIIL, Mysore (2010)
Agrawal, M., Bali, K., Madhvanath, S.: UPX: a new XML representation for annotated datasets of online handwriting data. In: International Conference Document Analysis and Recognition (ICDAR), vol. 2, pp. 1161–1165, Seoul, Korea (2005)
Saund, E., Lin, J., Sarkar, P.: PixLabeler: user interface for pixel-level labeling of elements in document images. In: International Conference Document Analysis and Recognition (ICDAR), pp. 446–450, Spain (2009)
Yin, F., Wang, Q.-F., Liu, C.-L.: A tool for ground-truthing text lines and characters in off-line handwritten Chinese documents. In: International Conference Document Analysis and Recognition ICDAR, pp. 951–955 (2009)
Elliman, D., Sherkat, N.: A truthing tool for generating a database of cursive words. In: International Conference Document Analysis and Recognition, pp. 1255–1262, USA (2001)
Slimane, F., Ingold, R., Kanoun, S., Alimi, M.A., Hennebert, J.: A new arabic printed text image database and evaluation protocols. In: International Conference Document Analysis and Recognition (ICDAR), pp. 946–950, Spain (2009)
Acknowledgments
This work is financially supported by Department of Science and Technology, Government of Rajasthan.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer India
About this paper
Cite this paper
Choudhary, P., Nain, N. (2015). CALAM: Linguistic Structure to Annotate Handwritten Text Image Corpus. In: Jain, L., Behera, H., Mandal, J., Mohapatra, D. (eds) Computational Intelligence in Data Mining - Volume 2. Smart Innovation, Systems and Technologies, vol 32. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2208-8_41
Download citation
DOI: https://doi.org/10.1007/978-81-322-2208-8_41
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2207-1
Online ISBN: 978-81-322-2208-8
eBook Packages: EngineeringEngineering (R0)