Review

The availability of databases is a fundamental requirement for development and evaluation in all scientific research domains. Standard datasets provide a platform for comparison and evaluation of different techniques on the same grid thus abstracting any possible bias [1, 2]. The task of collecting samples for database development is naturally cumbersome and tedious as it involves getting a maximum possible variety of samples from sundry participants. Having standard databases not only prevents the researchers from compiling the databases but also provides them with an opportunity to have an objective as well as comparative performance evaluation of their developed systems. Benchmark construction is not just the accumulation of samples, but an organized process of cull and abnegation of samples to be included in the database. Like any other scientific domain, document analysis and recognition community (DAR) has also developed a large number of document databases. The most researched and significant task in document analysis and recognition is handwriting recognition. Naturally, most of the standard databases developed by the document recognition community are handwritten databases.

The process of development of handwritten databases is as old as the problem of document analysis and recognition itself. This development of standard databases started to receive a notable attention in the early 1990s and the process still continues. Most important and widely used handwritten databases include IAM [24], RIMES [5], NIST [6], MNIST [7], CENPARMI [813], CEDAR [14], UNIPEN [15], ETL9 [16] and PE92 databases [17]. Although most of these databases have been developed using text in languages based on the Latin alphabet, development of databases in Chinese [18], Korean [17], Arabic [13, 1922], Farsi [10, 12, 23, 24] and Indian scripts [25] is also on the rise. The trend of multi-script handwritten databases [26, 27] has also been observed in the last few years. These handwritten databases comprise a variety of samples including handwritten digits [6, 7, 13, 21, 28], characters [14, 17, 2931], words [13, 14, 19, 21, 23, 24, 28], or complete sentences [3, 5, 18, 26, 27]. A step ahead to benchmark construction is the organization of evaluation campaigns and competitions allowing researchers to compare their systems under the same experimental setups.

This paper is intended to provide a comprehensive survey of the handwritten databases developed during the last two decades. We not only discuss the statistics of these databases but also present a comparative analysis on different dimensions including the size of database, number of contributors, textual content of the database, data acquisition mode (online or offline), writing script and the tasks which could be evaluated on a given database. This study is likely to be helpful for researchers in selecting the most appropriate databases for evaluation of their developed systems. We first discuss the basics of handwriting benchmarks in Section 2 followed by a detailed review of the well-known handwritten databases, their structure and usage in Section 3. Section 4 provides an overview of the evaluation campaigns and competitions organized using these databases while the last section concludes the paper with a discussion on future trends on the subject.

Handwriting benchmarks: basics

Research in handwriting recognition and related problems has been carried out in online as well as offline domains. Benchmarks have, therefore, been developed both for offline and online analysis of handwriting. Offline samples of handwriting are collected by making individuals write on paper with a typical writing instrument (pen or a pencil) and digitizing the paper documents using a scanner. Online databases of handwriting are produced by requiring the subjects to directly write on a digitizing tablet or similar devices. Writing is produced using a stylus or directly through finger. In addition to the writing strokes in terms of x-y coordinates of the pen position, online handwriting also contains additional information including pen pressure, writing speed, stroke order, etc. Offline datasets of handwritten text may comprise alphanumeric characters, isolated words, or complete paragraphs. Generally, these databases are produced by requiring the subjects to fill standardized forms with already specified or an arbitrary text. These forms are then scanned into a digital format. Online handwriting databases also comprise isolated characters, words, or sentences. Since the collection of online data requires the subjects to directly produce their samples on digitizing devices, online data collection is generally considered relatively easier but naturally requires specialized hardware for acquisition of samples.

The next step after data collection is the labeling of data to produce the ground truth. The ground truth associated with a database determines the tasks that could be evaluated using the database. Labeling is generally carried out at character, word, or line levels to support the traditional preprocessing, segmentation and recognition tasks. In addition, some databases also support evaluation of tasks like document layout analysis, word spotting, writer demographics classification, writer identification and writer verification.

The next section presents a detailed discussion on the handwritten databases developed during the last two decades.

Handwriting benchmarks survey: structure and usage

A large number of handwritten benchmark datasets supporting the evaluation of a variety of preprocessing, segmentation and recognition tasks have been developed over the years. These database could be categorized on different dimensions including the data acquisition method (online or offline), script, size, or the types of tasks supported. In our discussion, we have grouped the databases as a function of the script of the writing samples. These include the databases of Roman/Latin script, Chinese, Japanese and Korean (CJK) writings covering East Asian languages, Arabic and Arabic-like scripts and different Indian scripts. The handwritten databases developed in each of these scripts are discussed in the following.

Databases in the Roman script

The Roman or Latin script is the most widely used writing system based on the letters of classical Latin alphabet. With minor variations, Roman script covers English, French, German, Spanish, Portuguese, Swedish and Dutch languages. Some other languages have also migrated to this script, Malaysian and Indonesian being the most notable of these. Consequently, a significant proportion of the handwritten databases comprise text in the Roman script. The following sections discuss in detail the well-known handwriting databases in the Roman script.

IAM databases

The IAM databases are easily the most widely used collections of handwritten samples employed for a variety of segmentation and recognition tasks. A number of offline and online databases have been developed under the IAM umbrella as discussed in the following.

IAM-DB:IAM handwriting database

The IAM Handwriting Database [2, 3] comprises handwritten samples in English which can be used to evaluate systems like text segmentation, handwriting recognition, writer identification and writer verification. The database is developed on Lancaster-Oslo/Bergen Corpus and comprises forms where the contributors copied a given text in their natural unconstrained handwriting. Each form was subsequently scanned at 300 dpi and saved as gray level (8-bit) PNG image. A complete filled form, sample lines of text and some words extracted from a sample form in the database are illustrated in Fig. 1. The IAM Handwriting Database 3.0 includes contributions from 657 writers making a total of 1539 handwritten pages comprising 5685 sentences, 13,353 text lines and 115,320 words. The database is labeled at sentence, line and word levels. The database has been widely used in word spotting [3235], writer identification [3640], handwritten text segmentation [4143] and offline handwriting recognition [4447].

Fig. 1
figure 1

Samples of handwritten text from the IAM database [3]

IAM On-Line Handwriting Database (IAM-OnDB)

IAM-OnDB [4] is a collection of online handwritten samples on a white board acquired with the E-Beam System. Data is stored in xml-format which, in addition to the transcription of text, also contains information on writers and writer demographics. The database comprises 221 writers contributing a total of more than 1700 forms with 13,049 labeled text lines and 86,272 word instances from a dictionary of 11,059 words. In addition to recognition of online handwriting [48, 49], the database has also been employed for online writer identification [50] and gender classification from handwriting [51].

IAM Online Document Database (IAM on-Do)

The IAM on-Do [52] is a relatively new database of online handwritten documents containing text, drawings, diagrams, formulas, tables, lists and markings as indicated in Fig. 2. The database can be employed for document layout analysis and different segmentation and recognition tasks. The database consists of 1000 documents produced by approximately 200 writers. Few constraints were imposed on the writers while creating the documents. Nonetheless, the database has a stable distribution of the different content types and presents a collection of samples close to those encountered in real-world scenarios. The database has been employed for mode (content type) detection [53], keyword spotting [54] and classification of text/non-text objections [55].

Fig. 2
figure 2

A sample of the IAM on-Do database [52]

IAM Historical Document Database (IAM-HistDB)

The IAM-HistDB is a repository comprising handwritten historical manuscript images together with ground truth data. The IAM-HistDB currently includes Saint Gall Database [56] of ninth century containing manuscripts written by a single writer in Carolingian script. The original manuscript is housed at the Abbey Library of Saint Gall, Switzerland. The manuscript images are made available online by the E-codices (Virtual Manuscript Library of Switzerland) project and a text edition was attached at page-level by the Monumenta project. IAM additionally added binarized and normalized text line images to the manuscript data. Altogether, the manuscript data contains page images (jpeg, 300 dpi), binarized and normalized text line images and text edition at page-level (word spelling, capitalization, punctuations, etc.). These images have been employed for text-line segmentation [57, 58], binarization [59, 60], keyword spotting [34, 61] and handwriting recognition [62].

RIMES

RIMES [5, 63] is a representative database of an industrial application. The main idea of developing this database was to collect handwritten samples similar to those that are sent to different companies by individuals. Each contributor was assigned a fictitious identity and a maximum of up to five different scenarios from a set of nine themes. These themes included real-world scenarios like ‘damage declaration’ or ‘modification of contract’. The subjects were required to compose a letter for a given scenario using their own words and layout on a white paper using black ink. A total of 1300 volunteers contributed to data collection providing 12,723 pages corresponding to 5605 mails. Each mail contains two to three pages including the letter written by the contributor, a form with information about the letter and an optional fax sheet. The pages were scanned, and the complete database was annotated to support evaluation of tasks like document layout analysis [64, 65], mail classification [66], handwriting recognition [6772] and writer recognition [38, 73].

NIST: handwriting sample image databases

The National Institute of Standards and Technology, NIST, developed a series of databases [6] of handwritten characters and digits supporting tasks like isolation of fields, detection and removal of boxes in forms, character segmentation and recognition. A sample form from the database is illustrated in Fig. 3. The form comprises boxes containing writer information, 28 boxes for numbers and 2 for alphabets while 1 box for a paragraph of text. The NIST Special Database 1 comprised samples contributed by 2100 writers. The latest version of the database, the Special Database 19, comprises handwritten forms of 3600 writers with 810,000 isolated character images along with ground truth information. This database has been widely employed in a variety of handwritten digit [7477] and character recognition systems [7881].

Fig. 3
figure 3

A sample filled form from the NIST database [6]

MNIST: a database of handwritten digits

MNIST is a large collection of handwritten digits [7] with a training set of 60,000 and a test of 10,000 samples. MNIST is a subset of the NIST database discussed earlier and is composed of samples from the NIST Special Database 3 (SD-3) and Special Database 1 (SD-1). Initially, SD-3 was proposed to be employed as training and SD-1 as test set. However, samples in SD-3 were contributed by the employees of the Census Bureau while those of SD-1 were written by high school students. As a result, SD-1 offered more challenges in terms of recognition as opposed to SD-3. To ensure uniform distribution of samples from SD-1 and SD-3 in training and test sets, the MNIST database was compiled with a training set of 30,000 images from SD-1 and 30,000 from SD-3. In a similar fashion, the test set comprised 5000 samples each from SD-1 and SD-3 databases. The database has been extensively employed in a number of digit recognition systems [8287].

CEDAR databases

The Center of Excellence for Document Analysis and Recognition (CEDAR), at the State University of New York at Buffalo, has developed a number of handwritten databases [14] including handwritten words, ZIP codes, digits and alphanumeric characters. These databases were mainly intended to support research in automatic processing of postal addresses on the envelopes. The samples contain 5632 city words, 4938 state words and 9454 ZIP codes. This makes a total of 27,835 alphanumeric characters segmented from address blocks and 21,179 digits segmented from ZIP codes. The words in the database are divided into separate subsets for training and test. This database has been used for the evaluation of a number of systems including handwriting segmentation [88, 89], cursive digit recognition [74, 9092] character recognition [90, 93, 94] and word segmentation [95] and recognition [9698].

IRONOFF: the IRESTE on/off dual handwriting database

The Institute de Recherche et d’Enseignement Supérieur aux Techniques de l’Electronique, IRESTE, developed a dual on/off database [99], named IRONOFF. The database comprises handwritten samples of French writers including characters, digits and words. The contributors were required to fill forms having predefined boxes, and the ground truth information and the filled forms were later inspected by human operators. Each contributor filled three types of forms which have been named as B, C and D. The information on form B includes the lower- and uppercase letters of the alphabet, digits, the Euro symbol and the frequently occurring strings in French checks. Forms C and D comprise cursive words in French. The database contains a total of 1000 forms with 32,000 isolated characters and 50,000 cursive words. The online collection of these images is stored in UNIPEN format at a sampling rate of 100 points/second. The database has been employed in a variety of recognition tasks [100103] as well as online writer identification [104].

The RODRIGO database

RODRIGO [105] is one of the very large databases containing diverse samples of historical manuscripts in Spanish. The RODRIGO database was generated from an old manuscript (of 1545) written in old Castilian (Spanish) by a single author. The writing style is mainly influenced from the Gothic style. The database is spread on 853 pages and is further divided into 307 chapters describing chronicles from the Spanish history. Each page contains a well-separated single text block of calligrapher handwriting. The complete manuscript was digitized by the experts of Spanish ministry of culture in 300 dpi with true colors. The database can be employed for research on historical manuscripts [106108].

Indonesian handwritten text database

A database of Indonesian handwritten text [109] was compiled to support recognition and segmentation tasks. The database was developed by writing samples contributed by college students making a total of 200 scanned forms. These forms comprise isolated and cursive digits, isolated upper- and lowercase characters and words and can be employed for evaluation of a number of recognition tasks.

Database for bank-check processing

A recent database for evaluation of check processing and word recognition systems is presented in [110]. The database comprises cursive words in English, courtesy amounts and signatures. The ground truth of the database is developed in XML and, in addition to transcription of text, also contains the identity of the contributing writers. The database can be employed for recognition of words as well as verification of signatures.

IBM UB database

The IBM UB database [111] developed at the Center for Unified Biometrics and Sensors (CUBS) at the University at Buffalo is a multi-lingual online/offline database of handwritten samples. The writing samples include paragraphs of free text, filled forms, words, isolated characters and symbols. Writing samples were collected on IBM’s CrossPad with an electronic pen which simultaneously produced ink on the paper and captured the trajectory of the pen. The online data is available in the InkML format while the offline images are scanned as ‘png’ files. The database is divided into two parts, IBM UB 1 and IBM UB 2. The IBM UB 1 comprises cursive handwritten texts in English with more than 6500 online pages collected from 43 writers and around 6000 offline pages contributed by 41 writers. The IBM UB 2 contains short phrases, digits and isolated characters in French produced by 200 writers. The database has been used for online handwriting recognition [112] and writer identification tasks [113].

CVL Database

CVL [27] is a database of handwritten samples supporting handwriting recognition, word spotting and writer recognition. The database comprises seven different handwritten texts, one in German and six in English. A total of 310 volunteers contributed to data collection with 27 authors producing 7 and 283 writers providing 5 pages each. The ground truth data is available in XML format which includes transcription of text, the bounding box of each word and the identity of writer. The database has been used for writer recognition and retrieval [114] and can also be employed for other recognition tasks. A sample image from the database is shown in Fig. 4.

Fig. 4
figure 4

Sample handwritten text from CVL Database

In addition to text, a database of handwritten digit strings contributed by 303 students has also been compiled [115]. Each writer provided 26 different digit strings of different lengths making a total of 7800 samples. Isolated digits were extracted from the database to form a separate dataset—the CVL Single Digit Dataset. The Single Digit Dataset comprises 3578 samples for each of the digit classes (0-9). A subset of this database has also been used in the ICDAR 2013 digit recognition competition [115].

Firemaker Database

The Firemaker database [116] comprises handwritten samples of 250 Dutch individuals with 4 samples per writer making a total of 1000 writing samples. Each writer copied a given text in the normal writing style on page 1 while on page 2 the writers copied the given text using only uppercase letters. Page 3 of each writer comprised ‘forged’ text whereas on page 4 the writers provided their own text describing the contents of a given cartoon. All pages were scanned at 300 dpi as gray-scale images. The database has been mainly employed for evaluation of writer identification and verification systems [37, 117].

Databases in the Arabic and Arabic-like scripts

Arabic is the second most widely used script after Roman and supports languages like Arabic, Urdu, Pashto and Farsi. The initial research in document analysis and recognition mainly focused on text in Roman scripts only and it was relatively late that Arabic and other Arabic-like scripts started to receive notable research interest. During the last decade, however, significant research has been carried out on Arabic handwriting recognition and other related tasks. Consequently, a significant number of Arabic and similar databases have been developed in the recent years. We present an overview of well-known handwritten Arabic and Arabic-like databases in the following sections.

IFN/ENIT

The IFN/ENIT database [19] comprises handwritten Arabic words representing names of towns and villages in Tunisia along with the postal code of each. The database has been developed by contributions from 411 volunteers each filling a specified form. The total words (city/town names) in the database sum up to 26,400 corresponding to 210,000 characters. The ground truth data with the database includes information on the sequence of character shapes, baseline and the writer. All filled forms were digitized at 300 dpi and stored as binary images. The database mainly targets preprocessing [118120] and recognition of Arabic handwritten words [121126] but has also been employed to evaluate writer identification systems [127129]. Figure 5 illustrates a town name from the database written by 12 different writers.

Fig. 5
figure 5

Examples from the IFN/ENIT-database: a town/village name written by 12 different writers [19]

The Arabic Database: ADAB

The online Arabic database ADAB was jointly developed by the Institut Fuer Nachrichtentechnik (IFN) and the Research Group on Intelligent Machines (REGIM) aiming to support research in online Arabic handwriting recognition. Writing samples were collected from 170 writers making a total of more than 20,000 Arabic words. The database is also accompanied with a tool which allows not only online data collection but also data verification and correction of erroneous data. The database has been employed in a number of segmentation and recognition studies [130133] as well as for online writer identification [133, 134].

Arabic Handwriting Database: AHDB

The AHDB [20, 135] is an offline database of Arabic handwriting together with several pre-processing procedures. It contains Arabic handwritten paragraphs, words and the words used to represent numbers on checks produced by 100 different writers. The database was mainly intended to support automatic processing of bank checks, but it also contains pages of unconstrained text (as indicated in Fig. 6) allowing evaluation of generic Arabic handwriting recognition systems as well. The database can be employed in handwriting recognition [136] and writer identification tasks [137].

Fig. 6
figure 6

An example of free handwriting in Arabic from AHDB database

Arabic checks database

This database has been developed to advance research in automatic recognition and processing of Arabic checks [13]. The database comprises a collection of 7000 images of checks containing about 30,000 sub-words and more than 15,000 digits. A sample check from the database is shown in Fig. 7. The database can be employed for evaluation of automatic check processing and recognition systems [138].

Fig. 7
figure 7

A sample check from the database of Arabic checks [13]

The ARABASE

ARABASE [139] is a rich database for online as well as offline handwriting recognition. The database also supports recognition of offline machine printed Arabic text. The database includes complete paragraphs, words, isolated characters, digits and signatures. The database is also accompanied with a tool supporting traditional document analysis tasks on the database. The database can be employed for evaluation of (online/offline) handwriting recognition and signature verification systems.

CENPARMI Arabic handwriting database

The CENPARMI Arabic database [11] for offline Arabic handwriting recognition comprises isolated digits, letters, numerical strings and words. To support data acquisition, a two-page form was designed that was filled by 100 participants from Canada and 228 participants from Saudi Arabia. These forms comprised a sample Arabic date, 2 samples each of 20 digits, 38 numerical strings, 35 isolated letters and 70 Arabic words. The database is split into three sets. The first set comprises the forms of first 100 writers while the second set contains the forms filled by 228 writers. The third set is a combination of samples from set 1 and set 2. The database has been used for recognition of Arabic characters [140] and numerals [141] as well as word spotting [142].

IBN-E-SINA database

The IBN-E-SINA database [143] is developed on a manuscript provided by the Institute of Islamic Studies (IIS), McGill University, Montreal. The database is a part of the RaSI project which is aimed at creating a large-scale database of Islamic philosophical and scientific manuscripts, mostly written in Arabic with some contributions in Persian and Turkish. The document images were obtained using camera imaging (21 mega-pixels) at a resolution of 300 dpi. The selected dataset consists of 51 folios which correspond to 20,722 connected components (almost 500 CCs on each folio). The database has been used in a variety of interesting research problems on historical manuscripts [144, 145].

Al-Isra Arabic Database

The Al-Isra Database [21] is a large collection of handwritten samples containing words, digits, signatures and sentences compiled by researchers at the University of British Columbia. The samples were gathered from 500 students at Al-Isra University, Jordan. Each student produced a preselected list of words, digits and phrases. The database comprises 500 unconstrained Arabic sentences, 37,000 words, 10,000 digits and 2500 signatures. The database can be employed for handwriting recognition and writer identification tasks.

LMCA database

The On/Off LMCA (“Lettres, Mots et Chiffres Arabe” in French) is a dual Arabic database comprising characters, words and digits [22]. The database includes samples of 55 participants making a total of 500 words and 30,000 digits. The database is compiled in the UNIPEN format [15], the same as that of IRONOFF [99] database. The database can be used for online as well as offline recognition of Arabic words and digits.

KHATT database

KHATT [146, 147] is a comprehensive database of Arabic handwritten text comprising 1000 forms produced by same number of writers from different countries. Each form is scanned at three different resolutions, 200, 300 and 600 dpi. The textual content of the database comprises 2000 paragraphs randomly picked from multiple sources. The ground truth of the database is provided in xml format and includes the transcription of text at line and paragraph levels. The information about the writer of each sample is also stored. The database is also accompanied with tools that allow segmentation of text images into lines and paragraphs. In addition to recognition of handwriting, the database supports evaluation of a number of pre-processing and segmentation tasks as well as writer identification systems.

QUWI database

The Qatar University Writer Identification (QUWI) [148] database is a comprehensive collection of writing samples of 1017 writers of different cultural and educational backgrounds. A unique feature of this database is that it is a bi-script database where each author contributed four pages, two in English and two in Arabic. This allows using this database in a number of interesting writer identification scenarios. Another feature of this database is that page 1 and page 3 for each writer contains an arbitrary text from the writer’s own imagination in Arabic and English, respectively, while page 2 and page 4 of each writer comprises a fixed predefined text (in Arabic and English). This allows the database to be used in text-independent as well as text-dependent evaluation scenarios. The database was mainly developed to support evaluation of writer identification [149] and writer demographic classification systems [150152] but can also be used for handwriting recognition and similar related tasks.

AHTID/MW

The Arabic Handwritten Text Images Database by Multiple Writers (AHTID/MW) [153] has been developed to support research in the Arabic handwriting segmentation and recognition. In addition, the database can also be employed to evaluate the writer identification systems. The database comprises 3710 text lines and 22,896 words contributed by 53 different native writers of Arabic and is supported by the ground truth annotations. The database has been employed for evaluation of segmentation [154] and writer identification tasks [155].

IAUT/PHCN database

The IAUT/PHCN Database [24] is a collection of handwritten words representing Persian city names. The database was compiled using 1140 forms filled by 380 individuals. The database comprises a total of 200,000 characters and the ground truth includes Unicodes of characters in the city name and baseline information. All forms were scanned at 300 dpi and stored as binary images. The database has been mainly designed to support Farsi word recognition and preprocessing tasks [156158].

IFN Farsi database

Inspired by the IFN/ENIT Arabic database [19], the IFN Farsi database [23] was developed which comprises more than 7000 images for about 1080 Iranian city/provinces names. A total of 600 individuals contributed to data collection where each writer filled a maximum of two forms with 24 city/province names and their respective postcodes. The ground truth data, in addition to the transcription of text, comprises information on sequence and number of characters, dots and partial words. The database can be used to evaluate Farsi handwritten word and digit recognition systems.

CENPARMI Farsi database

The Center for Pattern Recognition and Machine Intelligence (CENPARMI) Farsi database [10] has been developed to support research in handwriting recognition and word spotting on Farsi text. The database is compiled from 400 native Farsi writers and comprises 432,357 images with dates, words, isolated letters, digits and numeral strings. Each image is provided in gray scale as well as binarized form. The database has been employed in evaluation of symbol/digit recognition [159] as well as Farsi handwriting recognition [160, 161].

FHT: Farsi handwritten text database

FHT database [162] is a repository of unconstrained handwritten texts produced by 250 participants who filled 1000 forms containing Farsi text. The database includes a total of 106,600 handwritten Farsi words, 230,175 subwords and 8050 sentences. Due to its diverse nature, FHT database can be used to evaluate a wide variety of systems including recognition of words and subwords, segmentation of words into characters, baseline detection, machine printed and handwritten textual content discrimination, writer identification and document layout analysis.

HaFT: Farsi text database

HaFT [163] is a large collection of unconstrained Farsi handwritten documents produced by 600 different writers. Each writer contributed three samples at different intervals of time and each sample comprises eight lines of text. This makes a total of 1800 handwritten text images. The database is mainly designed for training and evaluation of Farsi writer identification and writer verification systems but can also be used for different recognition and segmentation tasks.

CENPARMI Urdu database

The CENPARMI Urdu handwritten database [164] comprises Urdu words, characters, digits and numeral strings. A number of native Urdu speakers from different parts of the world contributed to the data collection process. The lexicon of 57 Urdu words and 44 Urdu characters mainly comprises financial terms to support recognition of offline Urdu words, characters and digits. This is the first published database on Urdu handwriting and has been employed in recognition and spotting of Urdu handwritten words [165, 166].

Urdu handwritten sentence database

A relatively new database of unconstrained Urdu handwritten text along with few pre-processing and segmentation algorithms is presented in [167]. The database comprises 400 forms filled by 200 different writers by copying the text given on each form. The forms were generated by taking text from six different categories of news with each category having up to 70 forms. The ground truth of the database includes transcription of text, information on lines and the identity of the writer. The database can be employed for recognition of Urdu text, line segmentation and writer identification.

CJK databases

CJK, the Chinese, Japanese and Korean, are the main East Asian languages. The writing systems of these languages partially or completely use the Chinese characters Hanzi, Kanji or Hanza. To facilitate research in different areas of handwriting recognition in these languages, a number of standard databases have been developed and distributed. We discuss the notable databases in the following sections.

PE92: handwritten Korean character image database

PE92 [17] is a very large and unique database comprising 100 handwritten image sets of 2350 Hangeul characters (Fig. 8). More than 500 writers contributed to the generation of first 70 sets while the last 30 sets were produced by one person. Writers filled pre-defined forms by writing characters in specified boxes. The database has been used in a variety of recognition tasks [168170].

Fig. 8
figure 8

Handwritten samples PE92 Korean character image database [17]

Online Japanese character pattern database

A database of online Japanese character patterns [31, 171] was compiled to support research in Japanese character recognition systems. These characters were extracted from unconstrained textual phrases provided by 80 writers. The text was collected from Japanese newspapers and produced 1227 frequently occurring Japanese character categories. The patterns were manually inspected and corrected to remove errors and wrongly written characters. The database has been used in a number of online character recognition systems [172175].

HCL-2000 Database

HCL-2000 [176, 177] is a large collection of frequently used Chinese characters produced by 1000 writers. In addition to the ground truth information of 3755 characters, information about the writers, their age and gender is also stored allowing evaluation of writer identification or demographic classification systems. The database has been employed in a number of Chinese character recognition systems [178180]

SCUT-COUCH2009: online unconstrained Chinese handwriting database

The SCUT online handwriting Chinese character recognition database, SCUT-COUCH2009 [181] is a revision and an enhanced version of SCUT-COUCH2008 [182] database. The database contains 11 datasets of diverse kinds of vocabularies and has been mainly developed to facilitate research in unconstrained online Chinese handwriting recognition. The database comprises individual Chinese characters in different standards, complete Chinese words and isolated symbols. The total character count in the database is more than 3.6 million. A sample image from the database is shown in Fig. 9. All the samples were gathered using PDAs (Personal Digital Assistants) and smart phone devices with touch screens and a total of 190 different individuals contributed to data collection. This database was the first publicly available large online Chinese handwriting database and has been employed in a number of online handwriting recognition tasks [183186].

Fig. 9
figure 9

Handwritten samples from the SCUT-COUCH 2009 Database [181]

CASIA: online and offline Chinese handwriting databases

CASIA [187, 188] is a widely used Chinese handwritten database comprising handwritten paragraphs as well as isolated characters The data was collected from 1020 individuals who produced writings on paper with a digital pen. This allowed capturing the online trajectory information as well as the offline images of text. The database was divided into six subsets, three comprising isolated characters (DB 1.0–1.2) and three having handwritten paragraphs (DB 2.0–2.2). The datasets of isolated characters comprise a total of about 3.9 million Chinese characters while the datasets of text (paragraphs) contain about 1.35 million characters. This database has been employed in a number of recognition [189192] word spotting tasks [193].

Touching character database

In order to assess character segmentation algorithms, a database of touching Chinese characters was compiled from the CASIA handwriting database [187, 188]. This database was termed as CASIA-HWDB-T [194]. The database includes more than 56,000 strings with two or more touching characters. More than 1800 strings comprise multiple touching characters. The database is also divided into interesting subsets like strings comprising all Chinese characters, mixed strings and digits. The ground truth data includes information on character classes and locations of touching points. This database can be used for character segmentation [195, 196] and recognition of broken or touching characters.

Databases in Indian scripts

Significant research has been carried out on document analysis and recognition problems in different Indian scripts. Several hundred languages are spoken and written in India with Hindi, Tamil, Telugu, Bengali, Kannada and Gujrati being the popular ones. Some of the languages share common scripts while others have unique scripts of their own. Well-known Indian scripts include Devanagari, Telugu, Tamil and Kannada. These diverse scripts offer a variety of interesting and challenging problems to the document recognition community. Despite a rich diversity of scripts and languages, the number of standard databases on Indian scripts is relatively small. We discuss the databases developed on different Indian scripts in the following sections.

Handwritten numeral databases of Indian scripts

A large database of handwritten numerals in two popular Hindi scripts is presented in [197]. The database was compiled by collecting numerals from postal mails and job application forms in Devanagari and Bangla scripts. A total of 22,556 Devanagari numerals were collected from 368 postal mails and 274 job application forms. In a similar fashion, 23,392 Bangla numerals were collected from 465 mails and 268 job applications. All images were digitized at 300 dpi and saved as gray scale ‘tif’ images. The database has been used to evaluate a number of digit recognition systems [25, 198200].

Kannada handwritten document dataset

The Kannada Handwritten Text Database (KHTD) [201] comprises 204 writing samples in a popular Indian script Kannada. The database has been developed by collecting writing samples from 51 native speakers of Kannada, and the textual content comes from four different categories. The database has a total of more than 4000 lines of text and 26,000 words. The database can be employed in a number of segmentation and recognition tasks at line, word or character levels.

A database of Tamil handwritten city names

A database of handwritten city names in Tamil, a popular script in India and Sri Lanka, is presented in [202]. The database includes a total of 265 different city names with 109 cities from Indian state of Tamil Nadu and 156 cities from Sri Lanka. Each city name has 100 instances in the database and a total of 500 writers with different educational backgrounds contributed to data collection. The database is also accompanied with algorithms to automatically segment city names from the image. Out of the 265 city names, 258 comprise only 1 word, 5 names include 2 words and 2 names contain 3 words with an average of 7 characters per city name. The database can be used for recognition of handwritten Tamil words.

Devanagari numeral and character database

A database comprising Devanagari numerals and characters is presented in [203]. Writing samples of 750 individuals belonging to different educational backgrounds, ages and professions were collected. The database comprises a total of 5137 isolated numerals and 20,305 isolated characters stored as binary ‘tif’ images. The database has been made available publically and can be used for recognition of Devanagari characters

Miscellaneous

After having discussed the handwritten databases in Roman, Arabic, CJK and Indian scripts, we now present few other databases in the following.

AMHCD: a database for Amazigh handwritten character recognition research

This database has been developed to support research activities on Amazigh text. Amazigh is spoken by millions of people in Africa mostly for oral communication. The Moroccan government took the initiative to promote Amazigh in mass media as well as the educational system. As a part of these efforts, the IRF-SIC Laboratory at the Ibn Zohr University, Morocco developed the AMHCD database [204] comprising a total of 25,740 isolated characters contributed by 60 different writers (Fig. 10). Each author produced 13 examples of each Amazigh character. The collected documents are scanned at 2400 dpi and are stored as colored ‘jpeg’ images. The database mainly targets the recognition system for handwritten Amazigh characters [205, 206].

Fig. 10
figure 10

A filled form in the AMHCD database [204]

GRUHD: database of Greek unconstrained handwriting

The GRUHD [29] database is a huge collection of unconstrained Greek text. The database includes sentences, characters, digits and other symbols. The writings have been produced by 1000 writers with equal distribution of male and female writers. The database comprises 1760 forms having 667,583 symbols and 102,692 words. The database has been employed for character/symbol recognition [207209] and discrimination of machine-printed and handwritten texts [210].

MRG-OHTC database

MRG-OHTC [211] is a collection of online Tibetan writings facilitating research in online Tibetan character recognition. A total of 130 Tibetan writers produced the database comprising 910 Tibetan characters from the basic and extended Tibetan character set. The writing samples are collected on a digital tablet using an electronic pen. The database has been employed for evaluation of Tibetan character recognition systems [212, 213].

Discussion

After having discussed the databases in different scripts, we now present a comparative overview of these databases in Table 1 along with a critical appreciation. The databases are ordered by year of publication and are compared on the basis of the following criteria.

  • Content of writing (sentences, words, characters or digits)

    Table 1 An overview of the databases discussed in the paper
  • Handwriting mode (online or offline)

  • Language or script

  • Total number of writers

  • Total number of samples

  • Problems on which databases could be employed

It can be observed from Table 1 that the trend of development of standard databases and their ground truth labeling has witnessed a notable growth in the last few years. Attempts have been made to capture as much variation in writing as possible by considering a large number of writers in the data collection process. In terms of number of writers, RIMES database [5, 63] seems to be the most comprehensive with around 1300 individuals contributing their writing samples. From the view point of number of writing samples, RIMES comprises more than 12,000 pages of handwritten text, one of the largest collection of unconstrained handwritten images. This database, however, is not publically available. In terms of usage, the IAM handwriting database [2, 3] is one of the most widely used databases for a number of recognition tasks. The only major issue with IAM database is the non-uniform distribution of samples per writer which varies from more than 50 for 1 writer to 1 for about 350 writers. This complicates the evaluation protocols for writer identification and verification systems where varied amount of text per writer is available to train and test the systems. Nevertheless, IAM databases remain one of the most popular databases employed by the handwriting recognition community. Likewise, for research on Arabic handwriting, the IFN/ENIT database has been most extensively employed for Arabic handwriting recognition and Arabic writer identification.

Naturally, most of the databases discussed in our study are based on English or Arabic writing samples. This is due to the significant research attention these languages have received over the last three decades. During the last few years, however, research on text in other languages has also gained interest resulting in the development of handwritten databases in many languages like Farsi and Urdu. A trend of having multi-script databases can also be witnessed in the recently developed databases. Such collections provide an opportunity to study the interesting scenarios of finding common writing patterns of individuals across different scripts. QUWI database [148] is an example of such a multi-script database comprising writing samples in Arabic and English. Another interesting aspect in recent databases is that instead of simply keeping the identity of the writer, additional information including the age, gender and background of the writer is also stored allowing development and evaluation of automatic user demographics classification systems, a relatively less explored area in handwriting analysis.

From the view point of textual content, the preliminary databases in all the scripts mostly comprised isolated characters, digits or words. These databases were mainly employed to evaluate the initial research endeavors in recognition of characters, digits and words. With the advancement in computerized recognition of handwriting, databases comprising unconstrained text (paragraphs) in natural writing styles of contributors were developed. These databases allowed evaluation of unconstrained handwriting recognition rather than simply character or word recognition.

An important parameter in the analysis of different databases is how well they represent the real-world scenarios. Databases where the acquisition is unconstrained and provides writers the flexibility to write in their natural styles are more close to the writing samples encountered in the real-world problems. For applications like handwriting recognition, significant training data is available, but for problems like forensic document analysis (writer identification, writer verification, etc.), the amount of text available to learn the characteristics of an individual is, in general, limited. Same is the case in the test phase where only limited text may be available to find the identity of an individual from a given writing sample. Systems developed for such applications should therefore be evaluated in experimental setups which match the real-world constraints. There is also a need to consolidate the large number of databases at a common platform allowing researchers in document analysis and recognition choose the most appropriate database(s) for development and evaluation of their systems.

As discussed earlier, the problems that could be evaluated using a given database are a function of the ground truth information provided with the database. For all recognition tasks, the database must be accompanied by the corresponding transcription (character, word or paragraph level). Likewise, systems dealing with identification or verification of writers and prediction of user demographics from handwriting require writer information to be stored along with each writing sample. Table 2 groups the databases discussed in this paper as a function of tasks in which they can be employed. Expectedly, most of the handwriting databases have been developed for evaluation of offline handwriting recognition systems. Few recent databases support evaluation of online recognition systems as well. The least explored area seems to be user demographics classification from handwriting and only a few databases contain the required ground truth (writer) information to evaluate such systems.

Table 2 Usage of databases

Campaigns, projects, competitions and results

During the recent years, the development of standardized datasets and their labeling has moved a step further to the organization of different evaluation campaigns and competitions. These competitions, related to different classical tasks of document analysis and recognition, not only allow a meaningful comparison of different algorithms under the same experimental conditions but also provide a platform for exchange of ideas and knowledge. This section is dedicated to the discussion of these campaigns and contests, but prior to that we present the UNIPEN project, a major milestone in online handwriting recognition.

UNIPEN project for online data exchange

The UNIPEN project [15] of data exchange was initiated by International Association of pattern recognition (TC-11) in 1992 with the objective of proposing a uniform format for representation and exchange of online data. The format was developed in collaboration with a group of 14 experts in online handwriting recognition. The participants of the project were asked to submit a minimum of 12,000 characters in any form (sentences, words or individual characters) and the approved data from National Institute of Standards and Technology (NIST) was made publically available. Presently, 11 datasets comprising characters, words and sentences have been compiled and software toolkits to manipulate the UNIPEN files are also provided with the database.

RIMES evaluation campaign

The RIMES project [5] funded by French ministries of defense and research was initiated to develop and evaluate automatic systems for indexing and recognition of handwritten letters. The project aimed at not only creating a large annotated database but also to organize a set of evaluation campaigns covering a variety of document recognition tasks which could eventually fit in different industrial applications. The first phase of the evaluation campaign [63] comprised tasks including document (letters and fax) layout analysis, handwriting recognition (isolated characters, words and blocks of text), writer identification (on words and paragraphs), writer verification, logo recognition and identification of scenario from letters. The second phase of the campaign [214] focused on three themes, document layout analysis, handwriting recognition and writer identification and a total of seven tasks. Five French research labs participated in this second phase of evaluations. After two successful phases of evaluations, the database was employed in a number of International competitions, discussed later in this paper.

Organization of competitions

The last few years have seen an increasing trend in the organization of International competitions on different tasks in document analysis and recognition. These contests are mainly advertised and organized in conjunction with the reputed document recognition conferences, International Conference on Document Analysis and Recognition (ICDAR) and International Conference on Frontiers in Handwriting Recognition (ICFHR) being the two most notable platforms. These contests provide training and validation datasets to the participants and require them to submit either the executables of their developed algorithms or the results on the unlabeled test datasets. A major proportion of these competitions are based on handwriting recognition and other related tasks. In most cases, the evaluation is carried out on published and well-known handwritten databases. In Table 3, we present a summary of the competitions based on the databases discussed in Section 3. It can be seen that IFN/ENIT is easily the most widely used database in the regularly organized Arabic handwriting recognition competitions. Recognition of online handwriting in different languages has also received an increased research attention. Other than the traditional recognition tasks, competitions on prediction of gender from handwriting have recently gained significant interest. Although a very large number of groups participated in these competitions, a relatively lesser number of groups actually revealed their identities and provided a description of their algorithms [149, 150]. In addition to the contests mentioned in Table 3, a number of other competitions on handwritten databases have also been organized but since they employ non-published or private databases, they are beyond the scope of our discussion.

Table 3 An overview of databases used in different competitions

Experimental protocols, evaluation metrics and state-of-the-art results

In this section, we discuss the experimental settings and evaluation metrics that are employed by researchers to solve the problems based on analysis of handwriting. As discussed earlier, the most important of these tasks is handwriting recognition which is carried out at character, word and line levels. Consequently, these systems report results in terms of character and word recognition rates. In some cases, the edit distance between the recognized text and ground truth text is used to quantify the recognition performance. Likewise, the handwritten keyword spotting systems are evaluated using the standard precision and recall measures. The two measures are generally combined into a single f-measure to represent the performance by a single number.

For writer identification systems, the performance is evaluated either using a leave-one-out-approach or by splitting the database into training and test sets, the later being more commonly employed. In most cases, in addition to the identification rate, the Top-K identification rates are also reported where for a given query document, a list of most similar K writers is retrieved which increases the chances of finding the true writer of the query document. Similarly, the performance of writer verification systems is represented through receiver operating characteristic (ROC) curves and is quantified through area under the curve (AUC) or equal error rates (EER). The closely related task of gender (and user demographics) prediction from handwriting is evaluated using the classification rate.

To provide an idea on the performance of state-of-the-art systems on different handwriting recognition tasks, we present a summary of some of the best results reported in the literature on commonly used handwriting databases in Table 4. Few of these results have been taken from the findings of different International competitions while others have been compiled from the literature as reported by the respective researchers (to the best of the authors’ knowledge). For handwriting recognition, a high word recognition rate of 94.85 % [69] is reported on the RIMES database. The recognition rates on the IAM database vary as different studies employ different evaluation protocols and an objective comparison is hard to make. The standard protocol for IAM lines comprises 6161 lines (45,000+ words) for training, 920 lines (7000+ words) for validation and 2781 lines (around 20,000 words) for testing. Word recognition rates in the range 80–90 % are reported by a number of studies [215, 216]. These recognition rates, in general, are lower than those reported on the RIMES database. It should however be noted that the RIMES test set comprises 1600 unique words while the complete IAM database comprises a vocabulary of more than 10,000 words, a major reason for relatively lower recognition rates. Regarding Arabic handwriting recognition, a high word recognition rate of 93.37 % [217] is reported on the IFN/ENIT database.

Table 4 Overview of state-of-the-art results on commonly used databases

Writer identification systems have been most evaluated and compared on the IAM database and a highest identification rate of 96.7 % is reported in [218] with one sample of each of the 657 writers in training and one in the test set. Like Arabic handwriting recognition, the writer identification systems targeting Arabic handwritings mostly employ the IFN/ENIT database. The system presented in [219] realizes the highest identification rate of 90 % on the 411 writers of this database. Writer identification rates on the recently developed KHATT database are relatively lower (73.4 %) mainly due to a large number of writers (1000) in the database. The QUWI database which includes writer demographics information has been employed for gender classification in a number of recent studies and a highest classification rate of 69.25 % is realized [150]. Although a two-class problem, gender prediction from handwriting is a challenging task as the correlation between handwriting and gender is not known to be very strong, a major reason for low classification rates. A step further in evaluation of writer identification and gender classification systems is the multi-script experimental setup where training and test samples come from different scripts. Naturally, the recognition rates (55 % on writer identification and 65 % on gender classification [220]) on these challenging problems are not as high as in case of a single script. Robust systems which exploit the common features of writers across different scripts need to be investigated to enhance the current state-of-the-art on these tasks.

Conclusions

Research in handwriting recognition and related areas is a challenging problem. The field has seen more than 30 years of intensive research, and state-of-the-art solutions have been developed for many problems. A number of handwriting recognition problems still remain inviting for the document recognition community and significant research targeting different aspects of handwriting recognition is being carried out presently. During the recent years, there has been an increasing trend of developing standard databases, compiling the ground truth data to support different recognition tasks and exposing the databases to the research community to explore and investigate their algorithms. In general, the statistics and ground truth information of each database is detailed in their respective publications.

This paper is an endeavor to provide a comprehensive survey of notable databases of handwritten text developed over the last two decades. For each database, we provided details on its structure, statistics, ground truth information and the tasks supported. Typically, these databases target one or more of the preprocessing, segmentation and recognition tasks. The type of task(s) that can be evaluated with a given database is a function of the ground truth data accompanying the database. In addition to the location and transcription of text, information about contributors is also stored in some cases allowing evaluation of writer recognition and writer demographics classification tasks as well.

We also discussed the evaluation campaigns and competitions organized using these databases. Organization of competitions in conjunction with reputed document and handwriting recognition conferences has become a regular activity for the last few years. The increasing number of participants in these competitions is a clear indication of the kind of research attention different problems of handwritten documents are attracting. In addition to the description of databases, we also summarized the state-of-the-art results on commonly used databases for a number of recognition tasks.

This contribution is likely to provide a summarized review of different databases allowing researchers choose the most appropriate datasets for evaluation of their proposed systems.