Introduction

Writing has been a significant milestone for humanity. It has allowed us to record and preserve knowledge, which can be passed on from one generation to the next. In modern times, writing has become indispensable, particularly in the industrial/commercial sector, where it performs a role in facilitating business transactions, such as administrative, legal, and financial documents [1].

In the digital age and with process automation, handwriting has lost its strength with the preference for online platforms. However, there are still documents that require validation through handwriting, such as forms [2], medical prescriptions [3], and bank checks [4]. As a consequence, modern systems must adjust and incorporate offline documents, which may include partially or entirely handwritten material. This is essential for migrating of historical data to digital environments.

In this context, the research field of offline Handwritten Text Recognition (HTR) has gained notoriety due to its objectives of identifying, recognizing, and transcribing cursive texts from images to digital media (ASCII, Unicode) [1, 5]. However, this is not a simple task, and the most difficult challenge in this research area is related to the complexity and variability of human handwriting. Unlike printed text, which has a homogeneous pattern, cursive writing has several factors that alter the pattern, due mainly to differences in style between writers and in style of a same writer within the same text. Consequently, handwriting recognition attracts attention in its development in both academic and industrial areas [1,2,3,4].

In recent decades, HTR systems have evolved significantly. The first approach with a great impact on handwriting recognition was based on Hidden Markov Model (HMM) models [6,7,8]. This approach led to the combination of Long Short-Term Memory (LSTM) neural networks for feature extraction, along with HMM for text decoding. This architecture also yielded good results with the use of Multidimensional LSTM (MDLSTM) layers [9, 10], and Connectionist Temporal Classification (CTC) as a way to calculate the loss function of the optical model [11].

Although MDLSTM-based systems have shown promise, their high computational cost has encouraged alternatives such as Convolutional Neural Networks followed by Bidirectional LSTM layers (CNN-BLSTM) [12,13,14]. Currently, optical models aim for low computational cost alongside high recognition performance, such as architectures based solely on convolutional networks, compact architectures, or networks based on attention mechanisms [14,15,16,17]. It is noteworthy that even with the evolution of optical models, language models have been used to complement the text decoding step, which provides better results than using the optical model alone [18, 19].

Despite achieving good results in the academic field, optical models are still unsatisfactory for many industrial use cases. One of the problems is the restriction in the volume of labeled data for training an optical model, which occurs with each new type of document to be transcribed. In other words, it is necessary to invest time and effort in labeling a minimal relevant amount of data to train an optical model, which is often impossible due to the lack of data [4, 20,21,22,23].

To minimize the issue of data restriction, some data augmentation approaches are employed in the workflow of an optical model to create synthetic text images. The first and most common approach is to apply random transformations in image preprocessing; the second is to increase the knowledge through transfer learning; and the third is to combine the preprocessing transformations with a robust model for low data volume. However, these approaches still have their limitations, which eventually lead to premature overfitting in the training of optical models.

Therefore, the present work presents a systematic literature review on data augmentation applied to offline handwritten text recognition. The aim is to identify, explore, and analyze the existing approaches for generating synthetic images of handwritten text, which can enhance the training of optical models. Thus, we intend to map the progress achieved in the last decade and discuss possible trends based on the knowledge obtained from the state-of-the-art. This period of time was selected due to the most significant recent advances in machine learning, which are interesting for the context of computer vision.

The rest of this paper is structured as follows: The section “Protocol Mapping” describes the methodology applied for the systematic review. Next, in the section “Results”, the results of the review are presented. The section “Discussion” answers the research questions and discusses the state-of-the-art. Finally, the section “Conclusion” presents the conclusions of the current work.

Protocol Mapping

Following the guidelines and protocols proposed in the works of [24, 25], we developed our method to plan and execute our study. Our goal was to define a specific scope for collecting a set of scientific papers that would make the review useful, accessible, and reproducible for the academic community. To achieve this, we considered detailed documentation of the process.

Based on the established scope, the papers that fit into it are selected, analyzed, and scored according to the research questions defined by the authors before the review. In addition, we used a combination of ZoteroFootnote 1 with a custom spreadsheet for paper management, tracking, and analysis.

The first step in a systematic review is to define its objective, followed by research questions, and only then the scope. Next, a search strategy is constructed, which includes search keywords, selection and exclusion criteria, and lastly quality assessment.

Research Objective

Data augmentation approaches generate synthetic data, which aim to minimize the problem of sample limitation in the training of deep learning models. Then, the main objective of the systematic review proposed in this work was to explore the literature and identify data augmentation approaches applied to offline handwritten text recognition.

It is important to note that handwriting recognition works often use some form of data augmentation, but it is not the central focus of the work and therefore not considered. On the other hand, works that solely focus on data augmentation of cursive text images are considered, as they have a central theme aligned with the objective of the review, and also offer great potential for application in the research area of handwriting recognition in general.

It is also worth mentioning that we have restricted the scope of this work to the specific research field of offline handwritten text recognition. In other words, we do not consider other research areas of text recognition, such as online and printed, nor sub-areas such as scene text, digits, signatures, and mathematical expression recognition.

Research Questions and Strategy

Research Questions (RQs) are defined to guide the review, direct the reading, and discuss specific aspects of the papers. Thus, we defined our research questions, aiming to identify and discuss the state-of-the-art regarding data augmentation approaches applied to offline handwritten text recognition. Four questions were formulated for this purpose:

  • RQ1: What are the most commonly used recognition levels for data augmentation applied to offline handwritten text recognition?

  • RQ2: What are the most commonly used datasets for data augmentation applied to offline handwritten text recognition?

  • RQ3: What is the current state of data augmentation research field applied to offline handwritten text recognition?

  • RQ4: What are the current challenges in data augmentation applied to offline handwritten text recognition?

The first question was defined to explore and understand the text structures most commonly used by data augmentation approaches. The second question aims to explore the most commonly used datasets and the languages with which they were built. The third question was defined to explore and understand the field of image data augmentation applied to offline handwritten text recognition systems. This involves understanding and analyzing different approaches within the research field to solve the same recognition problem. Finally, the fourth question was defined to discuss the gaps and trends in the research field of data augmentation of handwritten text images.

Based on our research objective and questions, we defined a set of keywords, period of time, and inclusion and exclusion criteria for our search strategy. Accordingly, we selected five academic databases to compose the review protocol: (i) ACM Digital LibraryFootnote 2; (ii) IEEE Digital LibraryFootnote 3; (iii) Science DirectFootnote 4; (iv) ScopusFootnote 5; and (v) Springer Link.Footnote 6

We selected these databases due to their extensive coverage in the field of technology, comprising a vast collection of scientific papers widely recognized in the academic area. It is important to note that we selected only direct academic databases, which enable the reproducibility of the work through a search string. Additionally, the search was conducted using the advanced search mechanism of each platform mentioned, taking into consideration all metadata and full-text papers as search sources. In this way, the keywords were defined with the following objectives: (i) to be directed toward the research area of handwritten text recognition; (ii) to have a comprehensive search of data augmentation approaches; (iii) to capture the most relevant works. Moreover, we analyzed different terms and variations to ensure that the search string is precise and not generic, with the specific focus on the research objective. The keywords are described in Table 1.

Table 1 Keywords and synonyms used to generate the search string

Based on the defined keywords, the following search string has been determined: (“handwritten text recognition” OR “handwriting recognition” OR “htr”) AND (“data augmentation” OR “image augmentation” OR “generator”) AND (“synthetic” OR “synthesis” OR “synthesize”).

The research field of offline handwritten text recognition has evolved significantly through deep learning, which is a relatively recent area of study. Thus, to better understand the studies developed over time, we have defined the period between 2012 and 2023, which is more than 10 years from the start of this review. This time frame is adequate to comprehend recent changes and advancements, as well as trends of interest.

The Exclusion Criteria (EC) are characteristics that disqualify the works from the review, which will not be included for the next step. The exclusion criteria are defined as follows:

  • EC1: Works that are not in the computer science subject area;

  • EC2: Works that are poster, tutorial, editorial, call for papers, shorts papers, book, book chapter, or thesis;

  • EC3: Ongoing works;

  • EC4: Works that are not in English;

  • EC5: Works that are not within the established period of time for the review;

  • EC6: Works that are literature revisions or surveys;

  • EC7: Duplicate works;

  • EC8: Works that are not available in full. A work is considered unavailable only after contacting the corresponding authors and receiving no response;

  • EC9: Works that are not within the scope of optical character recognition;

  • EC10: Works that do not present, or focus on, the offline handwritten text recognition problem;

  • EC11: Works that do not present, or focus on, a data augmentation approach applied to the offline handwritten text recognition problem;

  • EC12: Works that reached less than five points on the Quality Criteria.

It is worth mentioning that the five-point threshold was defined to cover works that present high quality in both the technical and descriptive aspects [26]. Finally, the Inclusion Criteria (IC) are characteristics that qualify the works for the next step. The inclusion criteria are defined as:

  • IC1: Works that present, or focus on, a data augmentation approach applied to the offline handwritten text recognition problem.

Research Steps and Information Extraction

With the research scope defined, we followed a four-step pipeline for study selection consisting of: (i) primary studies collection; (ii) preliminary selection; (iii) final selection; and (iv) quality assessment. For the first step, we applied the search string to the database sources. At this stage, some exclusion filters are applied directly from the search engines of each database. This refers to the period of time, subject area, document type, publication stage, and language. In addition, some platforms offer more filters than others, and hence, the same filters are considered and manually applied in the next step.

The titles and abstracts of the primary papers were briefly read. This first reading was performed by pairs of reviewers, where each reviewer included or excluded each paper, defining at least one inclusion or exclusion criteria. Any paper accepted by at least one reviewer advanced to the next step.

The papers selected in the previous step were fully read during the third step to look for false positives. Each paper was read by both reviewers and was again evaluated for inclusion or exclusion.

During the quality assessment step, the included papers were scored by the reviewers using the Quality Criteria (QC) defined below:

  • QC1: Is there a detailed description of the motivations, objectives, and contributions of the research? (weight 0.5)

  • QC2: Is there a detailed description of the dataset used? (weight 0.5)

  • QC3: Are the datasets used publicly available? (weight 1.0)

  • QC4: Is there a detailed description of the optical models used? (weight 0.5)

  • QC5: Is there a detailed description of the data augmentation approach? (weight 1.5)

  • QC6: Is the proposed approach applied in different recognition levels, such as words, lines, and paragraphs? (weight 1.5)

  • QC7: Is the proposed approach applied in different languages, such as English, French, and Spanish? (weight 1.5)

  • QC8: Is there a detailed description of the results achieved? (weight 1.0)

  • QC9: Do the results contribute to handwritten text recognition research area? (weight 1.0)

  • QC10: Is the source code of the approach publicly available? (weight 1.0)

For each question, the following scale was used: No (N) = 0.0 points; Partially (P) = 0.5 points; and Yes (Y) = 1.0 points. Additionally, each question has its corresponding weight, which contributes to the weighted average at the end of the assessment. Finally, we applied the EC12 (see the section “Research Questions and Strategy”) based on the final score of each paper.

To facilitate the discussion of the papers, the adopted strategy for information extraction was to collect the following data in each study:

  1. 1.

    Search engine/base;

  2. 2.

    Publication year;

  3. 3.

    Authors’ names;

  4. 4.

    Paper title;

  5. 5.

    Datasets;

  6. 6.

    Datasets languages;

  7. 7.

    Recognition level;

  8. 8.

    Model type;

  9. 9.

    Results.

Results

The initial corpus of the research comprised 976 primary papers, which 341 obtained from the IEEE Digital Library, 307 from Scopus, 194 from Springer Link, 80 from Science Direct, and 54 from the ACM Digital Library. Figure 1 shows the distribution of the primary studies among the databases in the first step of selection.

Fig. 1
figure 1

Distribution of primary studies among academic databases—last access in September 2023

It is important to emphasize that the selection obtained in the first only used the filters provided by the databases themselves (see the section “Research Steps and Information Extraction”). Nonetheless, the same filter criteria will be applied in the next step of selection.

Following the protocol defined, the papers were given to the reviewers during the second step, or preliminary selection. The reading consisted of titles and abstracts, seeking any information that could fit the paper into one of the 12 exclusion criteria. Out of the 976 primary studies, only 125 were selected through this first reading. It is worth noting that several papers were excluded due to the absence of filters in the research platform itself, as well as due to the duplication of studies among databases.

In the third step, the remaining papers were read in full. This secondary reading aimed to identify possible false positives that were not noticed before. This step was more demanding on the part of the reviewers and their considerations. Finally, the results of the evaluations served as the basis for selecting the works for the next step.

The full reading showed that many of the papers that were previously believed to be within the scope of the review were not related to the central theme. For example, works on handwritten text recognition with no focus on data augmentation, and works on scene text recognition and handwritten digit string recognition were the most common. Among the 125 papers previously selected, only 50 remained accepted for quality evaluation.

The last step of the review process, quality assessment, was conducted once again in pairs. The reviewers made notes on the selected papers based on the quality criteria described in the section “Research Steps and Information Extraction”. In this step, EC12 was applied based on the final score of each paper, which only 32 reached the minimum required score.

Regarding the application of ECs, their results are reported next. Overall, out of the 976 studies obtained from research platforms, the most common exclusion criterion was EC9, which corresponds to works that are not related to the scope of optical character recognition, reaching in 351 exclusions (40.5%). The second criteria most common was EC10 with 193 exclusions (22.3%), which corresponds to studies with different contexts applications, such as online, printed, and scene text recognition. The third criterion most common was EC11, with 148 exclusions (17.1%), which indicate studies of offline handwritten text recognition, but with no focus on data augmentation as a central theme of the work. More data on the exclusion criteria are shown in Fig. 2.

Fig. 2
figure 2

Proportion of exclusion criteria (EC) applied

After completing the selection process, we observed the new distribution regarding the databases used: 12 from IEEE Digital Library, 8 from Scopus, 9 from Springer Link, 2 from ACM Digital Library, and 1 from Science Direct. Figure 3 shows the distribution of studies among the databases after the four steps.

Fig. 3
figure 3

Distribution of studies selected among databases

Another important data to analyze are the distribution of selected studies over time through the year of publication. As shown in Fig. 4, we observe a trend in the research area of data augmentation applied to offline handwritten text recognition. The publication of studies that fit within the scope of this review began in 2016 with only 1 publication. Since then, the number of publications has been increasing, reaching 8 publications in 2022. It is worth noting that 2023 is not yet complete, providing only partial data; however, it already shows a promising trend with 10 papers. In addition, no papers published before 2016 were selected.

Fig. 4
figure 4

Distribution of selected studies over time based on year of publication. No papers were selected with a publication date prior to 2016

Additionally, Fig. 5 presents the overview of the systematic review conducted in this work.

Fig. 5
figure 5

Overview of the systematic review conducted. (i) A search string was used in five academic databases to collect primary studies. (ii) A preliminary screening by reading titles and abstracts identified relevant studies. (iii) A full-text reading removed false positives in the final selection. (iv) A quality assessment selected studies that fit the scope defined

Finally, the quality assessment scores of each work are presented in Table 2 in descending order.

Table 2 Quality scores of approved works in descending order

The following subsections are structured to present different aspects of the selected papers. Among these aspects, we present the recognition tasks and the datasets used in each work. We also present the types of approaches and the contribution to the state-of-the-art of each work.

Recognition Tasks and Datasets

The reviewed works present a wide range of applications in the research field of offline handwritten text recognition. This corresponds to our expectations, as there are different recognition levels and different application contexts.

We consider recognition levels as different text structural components: (i) characters; (ii) words; (iii) lines; and (iv) paragraphs. It is important to note that the paragraph level automatically considers the other three structural components (line, word, and character); the line level also includes the word and character structural components; and the word level involves the character one. This is also evaluated, as there is a degree of complexity in the recognition process for each text structure.

For the application contexts, we consider the characteristics and challenges that the datasets offer. In this way, sets of images that involve multiple writers tend to be more challenging than sets with only one writer, due to the high variability in writing patterns. On the other hand, sets of images of historical documents tend to be more challenging than images of form documents due to the high level of noise in the documents themselves. Furthermore, languages also influence recognition as they directly impact on the charset used by the optical model. Thus, we evaluated the applicability of the data augmentation approach for text images on different datasets.

As shown in Table 3, the different recognition levels found in the studies are displayed. In general, the most utilized recognition level was at word level, followed by text-line level. This indicates that these two levels share similar challenges, where data augmentation applied to words can expand to text lines, and data augmentation applied to text lines can contract to words. The other levels, characters and paragraphs, were the least used among the studies.

Table 3 Recognition levels utilized in the studies

Languages are intrinsic to the datasets, and the selected studies show a good diversity in this regard, in which English language is the starting point for nearly all studies. In that way, the dataset of the “Institut für Informatik und Angewandte Mathematik” (IAM) [59] was utilized by the works of [27,28,29,30,31, 34, 36,37,38, 40, 42, 44, 45, 48, 49, 51, 58], and is the most famous dataset within the study area, comprising 1,539 pages written by 657 different writers. Another dataset is the “Computer Vision Lab” (CVL-Database) [60], which is designed for writer retrieval and identification, including 311 different writers. It was employed in the works of [27, 32, 35, 37, 40, 49]. The historical Bentham dataset [61], which consists of images of letters by the English philosopher Jeremy Bentham (1748–1832), was utilized in the work of [47]. Furthermore, the English subset of the Maurdor dataset [62] was used in [34] and contains heterogeneous images of different types of documents. Finally, the dataset “GoodNotes Handwriting Kollection” (GNHK) [63] comprises unrestricted camera-captured images of English handwritten text from various regions, characterized by diverse styles and increased noise, was used in the work of [38].

French was the second most commonly used language in the selected studies. However, only two datasets were available in this language. The “Reconnaissance et Indexation de données Manuscrites et de fac similÉS” (RIMES) dataset [64], which comprises handwritten letters from various writers, is considered simple due to its image quality and uniformity. The RIMES dataset was used in the studies [27, 29,30,31, 34, 37, 40, 42, 46, 49]. The second dataset was the French subset of Maurdor, which was utilized in the study [34].

Regarding German language, the CVL-Database [60] was the most frequently used dataset [27, 32, 35, 40, 49], and the READ dataset [65], containing historical German documents, was utilized in the study of [30]. Finally, the Bullinger dataset, presented and utilized in the study [47], comprises historical letters written in German sent to the reformer Heinrich Bullinger (1504–1575).

The Arabic language was also significantly utilized, offering substantial variation across datasets. A first remarkable dataset in Arabic is the Maurdor [62] subset, which has approximately 13,000 text-line samples and was used in the study of [34]. OpenHaRT [66], boasting a large database of approximately 710,000 images, was utilized in the study of [46]. The dataset from the “Institute for Communications Technology/Ecole Nationale d’Ingénieurs de Tunis” (IFN/ENIT) [67], employed in the studies conducted by [35, 56, 57], offers character and word recognition capabilities and encompasses around 411 different writers. The “Arabic Handwriting Data Base” (AHDB) [68], used in the studies conducted by [56, 57], consists of characters and words derived from numerical values and bank check filling. Finally, the “Multilingual Automatic Document Classification Analysis and Translation” (MADCAT) database [69,70,71] was used by the work [51], and consists of handwritten Arabic documents scanned at high resolution, totaling 750,000 images of segmented lines.

The “Handwritten Kazakh and Russian” (HKR) dataset [72], representing Kazakh and Russian languages, has been utilized in the studies by [35, 41, 43]. Furthermore, the Chinese language is also represented by the dataset from the “Chinese Academy of Sciences’ Institute of Automation” (CASIA) [73], that was used in the works of [34, 37, 50, 53]. CASIA offers online and offline recognition versions. In the offline version, the studies worked with almost 1.4 million labeled characters.

Some datasets were less utilized, either because they were a subset of another dataset or because they were proposed for a specific competition. For the Spanish language, the “Spanish Numbers” dataset [74] was employed in the work of [31]. This dataset comprises handwritten numerals written by 30 different writers. For Vietnamese language, a small dataset was introduced in the Cinnamon AI Marathon competition: “the Cinnamon Handwritten OCR for Vietnamese Address Challenge” [75], which contains handwritten address images and was utilized in the work of [48]. Another Vietnamese dataset is the “Vietnamese Online Handwritten Text Recognition” (VNonDB) [76], used by [51], which is an online handwritten Vietnamese dataset released as a challenge for ICFHR2018 and converted into an offline version, comprising 100,000 images of word-level segmentations. In the case of Latin, a subset of the Bullinger dataset [47] was used in the work of [47]. The languages of Bangla and Mongolian are represented via the “CMATERdb” [77] and “Mongolian-Database” [35] datasets, respectively, as utilized in the studies by [39] and [35]. For the Sudanese language, the historical “Sundanese Palm Leaf Manuscript” (HSPLM) dataset [78] was used by [54] for character recognition. For the Urdu language, the work of [52] utilized two databases: the “Center of Language Engineering” (CLE) [79] and UCOM [80]. The CLE database contains 18,000 Urdu ligatures in Unicode format, while the UCOM database comprises 48 distinct lines of Urdu text authored by 100 different writers. For the Japanese language, the simulated “Japanese Handwriting Dataset” (JHD) [32] was adopted in the work of [32], along with the “ETL Character Database” (ETL) [81] in the work of [55]. These last two datasets contain handwritten character images.

Finally, Table 4 shows the distribution of all selected studies according to datasets and respective languages.

Table 4 Distribution of the selected studies across the datasets and respective languages (alphabetical order)

Data Augmentation

In this subsection, we delve into data augmentation approaches, which have been classified into three main categories in the offline handwriting recognition research field. The first, Digital Image Processing, comprises traditional methods with lower computational requirements and a stand-alone functionality. The second, Transfer Learning, encompasses strategies that utilize pre-existing datasets to enhance the training of optical models. Finally, Deep Learning refers to advanced techniques using deep learning architectures to augment data through image synthesis.

Additionally, it is noteworthy that some studies have used an end-to-end solution, creating their own data augmentation methods and testing them with standard handwriting recognition metrics. Others used metrics typically for image generation. In either case, these methods could be beneficial for the offline handwritten text recognition research field.

Digital Image Processing

The Digital Image Processing (DIP) involves algorithms to apply transformations to digital images. DIP is a traditional approach to data augmentation in the field of offline handwritten text recognition research, and its algorithms are widely used with optical models [30, 34, 35, 42, 50, 54, 55, 57]. This allows to apply randomness in text images transformations, which in turn helps to prolong the training process of optical models and prevent premature overfitting.

Initially, Shen and Messina [50] explored character-level segmentation to evaluate various strategies for generating synthetic text lines from isolated characters. These strategies range from simple processing, such as placing characters one after the other; to more complex processing, such as using the coordinates of characters in annotated lines to create images of text lines with a more realistic appearance. A key strength of this approach was that generating full pages rather than single lines led to more realistic images, preserving the placement and relative positioning of characters in text lines. However, the strategy still encounters challenges with deformations caused by varying height and width ratios of characters, suggesting a need for refinement in the page synthesis methodology. Subsequently, an optical model was trained using a balanced combination of synthetic and real images, which contributed to a relative improvement of 10.4% on CASIA dataset. This underscores the potential of augmenting the training data with synthetic images.

Wigington et al. [30] noted that Shen and Messina’s [50] proposal is promising, but highlighted its dependency on a character-level dataset to be effective. Thus, they proposed new methods for image normalization and deformation. The suggested normalization method is adaptive, accommodating variations in handwriting scale, which consequently improves the optical model’s tolerance to writing differences. Further, they introduced a distortion grid implementing random deformations to apply slight scale and inclination variations, character by character, within each word. This, however, might be computationally intensive and may require more optimization. Remarkably, they achieved Character Error Rates (CER—the lower, the better) of 3.0%, 1.4%, and 5.0% on the IAM, RIMES, and READ datasets.

Hayashi et al. [55] presented another approach for Kanji character recognition, based on a novel data augmentation technique involving statistical character structure models. The goal was not only to generate Kanji character images of diverse cursive writing styles, but also to do so using a unique probability distribution of character strokes that were immune to the influence of the original image. However, complete control over the generated characters’ structure remained a challenge, causing instability in the character images. Despite this, the approach contributed to a notable Character Accuracy Rate (CAR—the higher, the better) of approximately 93.1% on the ETL-9B dataset.

Using manifold mixup as basis, Moysset and Messina [34] proposed a new training strategy for offline handwritten text recognition systems. This strategy involves merging two input images or their corresponding feature maps and serves as a regularizer in the optical model training. The study did not compare their technique directly with other advanced data augmentation methods, leaving a gap in understanding its relative performance. The strategy also presented potential implementation complexities due to the need to adapt it to varying image sizes. Overall, they achieved a CER of 23.9%, 3.3%, 4.6%, 8.9%, 14.8%, and 10.5% on the CASIA, RIMES, IAM, and Maurdor French, English, and Arabic subsets.

Following, Luo et al. [42] proposed an adaptive data augmentation method for optical model training. This approach adaptively adjusts transformation functions based on the model’s learning progress, thereby gradually increasing the difficulty of images. The method demonstrated broad applicability, enhancing text recognition performance across diverse settings, as evidenced by the achieved CER of 2.4% and 5.1% for the IAM and RIMES datasets, respectively. On the other hand, the use of custom fiducial points and joint learning may add to the complexity of implementation, potentially making it difficult to use in general. The work is accessible in a public repository.Footnote 7

Eltay et al. [57] proposed a data augmentation method based on the frequency distribution of characters across the dataset. In this way, the method gives more weight to less frequent characters in a word, aiming to balance the character distribution across the dataset. While this approach effectively manages class imbalances, it does lean heavily on the character occurrence probabilities, which might make it less adaptable to datasets with different character distributions. Furthermore, the approach, which primarily solves class imbalance, might have limited effectiveness in situations where imbalance is not a key issue. Using the Word Accuracy Rate (WAR—the higher, the better) metric, the method achieved 99.0%, 95.1%, and 93.6% for the abc-d, abcd-e, and abcde-f subsets of the IFN/ENIT dataset, respectively. For the AHDB dataset, they achieved 98.1%.

Meanwhile, Hidayat et al. [54] addressed the challenge of limited and historical data samples in the HSPLM dataset, using data augmentation techniques. They did not just use geometric transformations, but also added noise to the image background, and adjusted brightness to generate new samples, enhancing the variety of their data. Their approach, however, was not just about generating more data; they also carefully balanced the data at the character level. It is worth noting that while these methods showed promising results with a CAR of 97.4%, they were specifically tailored to the ancient Sundanese characters of this dataset. It is unclear how these techniques would fare with other languages or character sets. Moreover, the paper did not touch on potential downsides or limitations of their augmentation methods, such as the possibility of overfitting with too much augmentation.

Finally, Chen et al. [35] presented a rule-based handwritten word augmentation method at the script level. The method initially divides the handwritten word into curve components, applies deformations, and then joins them back together. The authors put their approach to the test, proving it outperforms the traditional augmentation methods in experiments. In this way, they used the WAR metric and achieved 30.5%, 81.5%, 73.0%, and 72.6% for the Mongolian-Database, CVL-Database, IFN/ENIT, and HKR datasets, respectively. However, this method has some potential drawbacks. For one, it involves a more complex process than traditional methods, which can make its use difficult. Another challenge is that it relies on prior knowledge of the languages being used. In addition, the work is available in a public repository.Footnote 8

Transfer Learning

Transfer Learning proposals in the field of offline handwritten text recognition research involve storing the knowledge given by an optical model acquired in one dataset, and applying it to recognized samples from another dataset. Thus, two or more sets of document images that have good similarity in writing pattern can benefit from training a joint optical model. Even so, this approach was unusual and did not have much exploration as a data augmentation method, which only two selected studies used it [40, 52, 55].

In this context, Hayashi et al. [55] presented an additional feature of their Digital Image Processing (DIP) method by conducting a Transfer Learning experiment on the ETL-9B dataset, which they divided into three parts. In their comparisons of different training approaches, they found that sharing knowledge between subsets resulted in a modest but noteworthy improvement in character recognition. Specifically, the Character Accuracy Rate (CAR) increased by 1%, achieving an overall score of 94.5%.

Burdett et al. [40] proposed a combination of Transfer Learning and Active Learning as a solution for offline handwritten text recognition. In their work, the authors designed a training pipeline that leveraged pre-trained optical models within an Active Learning framework, thereby enhancing the learning outcomes. They tested their approach on IAM, RIMES, and CVL-Database datasets, and achieved CER values of 4.2%, 4.3%, and 4.8%, respectively, demonstrating the effectiveness of their method.

Recently, Memon et al. [52] developed a handwriting generation model trained initially on ligatures images and later fine-tuned via transfer learning on handwritten images from the CLE to UCOM database. Their results showed significant progress, achieving a Fréchet Inception Distance (FID—the lower, the better) score of 38.03, a Geometry Score (GS—the lower, the better) of 8.81 \(\times\) 10\(^{-4}\), and a recognition accuracy of 72.6%. The authors also highlighted the promise of transfer learning in handwriting tasks, especially when training data are limited, suggesting potential improvements when blending real and rendered handwriting data.

Deep Learning

Deep Learning is a machine learning technique that mimics how humans acquire certain types of knowledge. Nowadays, it enables optical models to achieve good results, which is why it is the most widely used approach [27,28,29, 31,32,33, 36,37,38,39, 41, 43,44,45,46,47,48,49, 51,52,53, 56, 58].

In this context, Alonso et al. [46] proposed using Generative Adversarial Networks (GANs) to generate synthetic images of handwritten words and integrate an optical model into the architecture. However, despite their innovative approach, there were still some artifacts visible in the generated images, indicating that the image quality might need further improvement. Thus, unlike other works that focused on reducing error rates, the authors aimed to measure the improvement obtained through the synthetic text images of the generative model. To this end, the authors used Fréchet Inception Distance (FID—the lower, the better) and Geometry Score (GS—the lower, the better) metrics, which achieved 23.94 and 8.58 \(\times\) 10\(^{-4}\), respectively. The FID is the current standard metric for assessing the quality of generative models, which compares the statistics of two distributions (generated and real images) calculating the distance between them. On the other hand, the GS metric involves comparing the geometrical properties of the underlying data manifold with those of the generated data. This method provides both qualitative and quantitative measures for evaluating the GAN’s performance. Additionally, their model achieved a Word Error Rate (WER—the lower, the better) of 11.9% on the RIMES dataset.

Based on this same idea, Fogel et al. [27] introduced the ScrabbleGAN model to generate synthetic images of handwritten words. This generative model not only generates a wide range of images, but it also has the ability to adapt to new styles, enhancing its versatility. One of the main issues is that it assumes all characters have the same width, limiting the diversity and realism of the generated images. Furthermore, while the model can create diverse styles, there is a lack of finer control over text style parameters. The ScrabbleGAN model achieved an FID of 23.78, GS of 7.60 \(\times\) 10\(^{-4}\), and Inception Score (IS—the higher, the better) of 1.33. The model also delivered WER values of 11.3%, 23.61%, and 22.9% on the RIMES, IAM, and CVL-Database datasets, respectively. The work is available in a public repository.Footnote 9

Following this research line, Kang et al. [33] presented the GANwriting model to generate also synthetic images of handwritten words. This model generates realistic images and can mimic specific writing styles, allowing it to create different handwritten styles for the same text content. However, the model, a Sequence-to-Sequence (Seq2Seq) [82], has a limitation in that it can only synthesize short words. This limitation can sometimes reduce the model’s flexibility and utility. Nevertheless, this innovative model achieved an FID score of 125.23 and an IS score of 1.33, demonstrating its ability to generate realistic text images that resemble those in the IAM dataset. This work is also available in a public repository.Footnote 10

Bhunia et al. [45] introduced a new approach called Handwriting Transformers (HWT) that creates synthetic images of handwritten text using the Transformer model [83]. The HWT captures long- and short-range contextual relationships within the writing style sample through a self-attention mechanism [83]. Unlike other models, HWT can work with text of any length and any style, giving it a lot of flexibility. However, it is also complex and computationally expensive. Even with these challenges, HWT performed quite well, achieving FID of 19.40, GS of 1.01 \(\times\) 10\(^{-2}\), and IS of 1.36 on the IAM dataset. The work is available in a public repository.Footnote 11

Liu et al. [29] developed the HTG-GAN model, which can synthesize text images with arbitrary length. The authors redefined the structural relationship between characters in a sequence by breaking the bond between style and content. This allows generating images with new styles and chosen content. However, HTG-GAN has the challenge of dealing with languages with many independent characters, such as Chinese or Japanese. This is because it uses an encoding strategy that considers the character’s place in the alphabet, which does not work well with these languages. The generative model achieved an FID of 12.18 and a GS of 2.23 \(\times\) 10\(^{-3}\). When used for handwriting recognition, they achieved WERs of 10.2% and 20.5% in the RIMES and IAM datasets, respectively.

Huu et al. [48] developed the Multilingual-GAN model for synthesizing text images. This new model is distinguished by its ability to work efficiently in multiple languages without additional training. Moreover, it is capable of generating diverse character styles, which enhances the versatility of the output. An aspect of their approach is the application of perceptual loss, which ensures content consistency between the input and the generated images. However, it is not without shortcomings. The model currently yields results that may exhibit blur and insufficient stroke precision. Additionally, the generated images can contain artifacts, affecting the overall realism. Despite these limitations, the authors explored and emphasized the importance of both adversarial and perceptual losses for producing realistic handwritten images. The study is publicly available in a repository.Footnote 12

Zdenek and Nakayama [32] proposed the JokerGAN, a new GAN architecture for offline handwritten text recognition. The model stands out due to its ability to use character sequences of variable lengths as conditional input, making it flexible and adaptable. It is also memory efficient, remaining largely unaffected by the size of the character set. This makes it possible to handle languages with a large number of characters, such as Japanese and Latin, simultaneously. An innovative feature of the model is its awareness of the vertical alignment of characters, which enhances the quality of generated handwritten text. However, the study does not delve into other computational costs such as training time, which could be a significant factor for large datasets. Despite these considerations, the model managed to surpass state-of-the-art models, achieving an FID of 9.18.

To enhance the recognition of the Arabic language, Eltay et al. [56] combined their previous work on adaptive algorithms with a GAN model. Their method managed the inherent issue of class imbalance in text data, a prevalent concern in this field. However, it is worth noting that the current work is restricted to generating individual words, not entire lines of text. Moreover, their efforts resulted in a WAR of 97.2%, 95.9%, and 93.8% for abc-d, abcd-e, and abcde-f subsets of the IFN/ENIT dataset. They also achieved a notable 99.3% for the AHDB dataset.

In their subsequent research, Kang et al. [31] showed that employing realistic synthetic texts during training is advantageous for enhancing the performance of handwritten text recognition. The authors switched the Seq2Seq model [82], a recurrent neural network (RNN), with the Transformer model [83], which is notable for its self-attention mechanism. This change enabled the generation of images with longer lines of text. However, the approach has limitations when handling special characters like accents, making it less effective for certain languages. Furthermore, to adapt to new handwriting styles, the model needs access to unlabeled text-line images, which could pose challenges in some situations. In addition, they achieved a CER of 8.62% and a WER of 26.69% on the IAM dataset. Similarly, on the RIMES dataset, they managed to attain a CER of 6.45% and a WER of 19.56%.

Luo et al. [49] improved their previous research by proposing the SLOGAN model, which synthesizes handwritten text images of arbitrary length. In their recent study, they synthesized writing data by parameterizing the style and controlling the parameters to generate new cursive writing styles. However, this system relies heavily on identifying the writer from the original images, which could limit its ability to cope with entirely new styles. While they can create new words or sentences, it might have difficulty with particularly rare or complex ones. The generative model achieved an FID of 12.06 and a GS of 5.59 \(\times\) 10\(^{-4}\). The optical model reached a CER of 3.4%, 5.9%, and 14.1% on the RIMES, IAM, and CVL-Database datasets, respectively.

In their research, Spoto et al. [47] employed GANs to facilitate the recognition of handwriting in historical documents. This was achieved through the integration of authentic and synthetically generated handwriting samples. The effort was considerably successful, leading to a significant reduction in character error rate (CER) ranging from 3% to a notable 60%. Nonetheless, this study had its constraints. The dependency on large amounts of training data could pose difficulties with smaller datasets. Furthermore, the synthetic samples, while reflecting the targeted style, lacked the inherent variability of natural handwriting.

Gan et al. [28] proposed HiGAN+, a novel generative model based on disentangled representations. HiGAN+ enables the synthesis of realistic handwritten text images conditioned on arbitrary textual content and diverse cursive writing styles, allowing for the generation of paragraphs with different styles. However, it is worth noting that humans handwriting can be highly detailed and intricate, posing challenges for HiGAN+ in synthesizing text that captures all these intricacies. Nevertheless, the model achieved FID of 9.65 and IS of 1.41 on IAM dataset. In addition, the research is publicly available in a repository.Footnote 13

Recently, Kudaibergen and Hamada [41] have focused on Russian handwritten text recognition. They employed GANs and used a model trained on synthetic data generated by ScrabbleGAN [27], resulting in a significant improvement in the optical model’s performance. However, it’s essential to note that the study was limited by its exploration of only one GAN architecture and a relatively low achieved accuracy. Moreover, the issue of optimal data size for training was raised but not fully investigated. Nonetheless, the experiment yielded promising results on HKR dataset, with a WAR increase up to 24.1% when combining different types of synthetic data.

Yeleussinov et al. [43] proposed a novel use of GANs for handwriting recognition. The GAN model consists of a handwriting word image generator and an image quality discriminator. In this way, the model is trained with multiple losses to learn the structural properties of texts and produce high-quality images of handwritten text. The study reached a CER of 11.15% and a WER of 25.65% on HKR dataset.

Das et al. [39] developed a GAN model to create synthetic handwritten Bangla compound characters. Their improved model, inspired by the Auxiliary Classifier GAN (AC-GAN), demonstrated an enhanced FID score compared to the original AC-GAN, which reached 7.81 on CMATERdb dataset. However, the study lacks comparative analysis with other state-of-the-art datasets, which may help provide a more comprehensive view of the model’s performance and position within the research field. The study is publicly available in the repository.Footnote 14

Wang et al. [58] proposed the AFFGANwriting model, which employs a VGG19-based style encoder to extract multiscale handwriting features and generate realistic handwriting images. The approach captures both global and local characteristics of handwriting, reaching an FID score of 28.65 on the IAM dataset.

Recently, Gui et al. [53] proposed a Denoising Diffusion Probabilistic Model (DDPM). This model transforms font library-based Chinese character images into handwritten samples. When tested on the CASIA dataset, the model trained with synthesized samples showed comparable recognition accuracy to training with real samples. In general, the DDPM-based approach achieved a 98.6% accuracy, outperforming other methods even when using fewer synthesized samples. On the other hand, the authors highlighted potential improvements, such as refining synthesis quality and the extended DDPM training time, suggesting more exploration.

Memon et al. [52] discussed the challenges of recognizing multiple cursive scripts due to limited labeled training data. The work proposed a content-controlled training approach for Urdu handwriting generation combined with a pre-trained recognizer loss. This model, trained on diverse ligatures images and further fine-tuned through transfer learning, is distinct from the predominant GAN-focused research. In this way, reached an FID score of 69.01 and an accuracy of 77% on CLE dataset and FID score of 23.24 and an accuracy of 69.7% on UCOM dataset.

Chang et al. [51] presented a method using GANs to generate handwritten content across different languages, with the goal of enhancing handwriting recognition in low-resource contexts. They reported FID scores for the VNonDB Vietnamese dataset as 27.46, 77.10, and 142.08 for printed, crosslingual, and semi-supervised GANs, respectively. For the MADCAT Arabic dataset, the scores were 23.28, 70.56, and 111.74. Moreover, they observed a notable improvement of up to 5 percentage points when doubling the data for augmentation. Nonetheless, the study might benefit from a more extensive evaluation across languages and a comparison with other existing approaches.

Nikolaidou et al. [44] introduced a method using a conditional Latent Diffusion Model to generate realistic word image samples across various writer styles. This approach leverages class index styles and text content prompts, eliminating the need for adversarial training, writer recognition, or handwriting recognition. On the other hand, the method requires a large amount of training data to learn the distribution of different writer styles and is computationally expensive. The evaluation on the IAM dataset reached an FID score of 22.74, a CER of 4.67%, and a WER of 13.28%. The study is publicly available in the repository.Footnote 15

Zhu et al. [37] presented the Conditional Text Image Generation with Diffusion Models (CTIG-DM) for generating handwritten text images. The model effectively synthesizes diverse text images, adaptable to specifics like content, font, and background. Although promising for real-world scenarios, including scene text and diverse handwritten scripts, CTIG-DM demands substantial training data and computational power, and may face challenges with unseen text styles. In their evaluations, they reported an FID score of 25.52 on the IAM dataset. Furthermore, when trained on the IAM dataset, the recognition model achieved CER and WER scores of 10.89% and 26.24%, respectively, on the CVL-Database.

Pippi et al. [36] introduced the Visual Archetypes-based Transformer (VATr), a model designed for generating synthetic handwritten text, emphasizing on capturing writer-specific styles, especially when faced with unseen styles or rare characters. The unique approach of VATr uses standard GNU Unifont glyphs to represent textual content, making it efficient in handling characters seen less during training. Furthermore, using pre-training on a large synthetic dataset, the model becomes adept at focusing on writing styles without getting distracted by backgrounds or ink textures. Experimental results were promising, with an FID score of 17.79 on the IAM dataset. The study is publicly available in the repository.Footnote 16

Finally, Zdenek and Nakayama [38] presented an extension of their previous work called JokerGAN++. In this study, the model uses a Vision Transformer (ViT)-based style encoder to generate handwritten text images, which can replicate specific handwriting styles from reference images and produce random styles as well. A unique feature is its ability to provide character-specific style encodings using the target character sequence. The authors registered FID scores of 2.13 on the IAM and 5.99 on the GNHK datasets. Furthermore, was recorded a WER of 25% on the IAM dataset with an additional 100,000 synthetic data.

Discussion

The papers identified in this systematic review satisfied our search criteria, showcasing a range of approaches, methods, and applications in the field of offline handwritten text recognition. Consequently, we were able to identify several research gaps, which were not adequately explored in the presented works. In the remainder of this section, we discuss specific and relevant topics, and provide answers to the research questions defined.

RQ1: What are the most commonly used recognition levels for data augmentation applied to offline handwritten text recognition?—Through our analysis of selected studies, we observed that approximately 51.3% employed word-level recognition for data augmentation in the research field of offline handwriting recognition [27, 29, 30, 32, 33, 35,36,37,38, 40, 42,43,44, 46, 49, 51, 52, 56,57,58]. This represents a broader application of a word-focused data augmentation approach.

Currently, word sequencing into lines is still considered a trend in generative models, representing an advancement in the field. However, the challenges of generating line structures are associated with the limited availability of data for training deep learning models and the computational costs involved. Although approximately 30.8% of the reviewed studies have focused on line-level applications [28, 30,31,32, 34, 38, 41, 45, 47,48,49,50], this presents significant opportunities for the development of handwriting recognition systems.

This leads us to reflect on character and paragraph scenarios, often unexplored, appearing in 12.8% and 5.1% of the reviewed studies, respectively. In paragraph scenarios, we encounter greater complexity than line structures, considering the sequencing of words and then stacking lines. This was a less explored approach due to its application and high cost. On the other hand, character scenarios mainly correspond to their application in glyph-based languages, such as Chinese and Japanese. Finally, Fig. 6 shows the proportion of recognition levels used by studies.

Fig. 6
figure 6

Proportion of recognition levels used by studies. Each work may have more than one type of recognition associated

RQ2: What are the most commonly used datasets for data augmentation applied to offline handwritten text recognition?—In general, we noticed that the papers presented prioritized using the IAM, and RIMES datasets, with approximately 32.8%, and 17.2% of the papers [27,28,29,30,31, 34, 36,37,38, 40, 42, 44,45,46, 48, 49, 51, 58]. These datasets hold significant prominence in the field and serve as well-established benchmarks. Following closely, the CVL-Database is popular for its focus on historical handwritten documents, while CASIA is valuable for research involving glyph-based language, making them the second most frequently employed, appearing in about 10.3% and 6.9% of the papers [27, 32, 34, 35, 37, 40, 49, 50, 53]. The third group of frequently used datasets includes IFN/ENIT and HKR, with 5.2% and 5.2% of the papers [35, 41, 43, 56, 57]. Finally, the least used datasets reached less than 5% of the papers. This refers to specific studies on a particular dataset, many of which explore a particular language or even introduce a new dataset. Figure 7 shows the distribution of dataset usage among the reviewed studies.

Fig. 7
figure 7

Proportion of datasets used by studies. Each work may have more than one dataset associated

RQ3: What is the Current State of Data Augmentation Research Field Applied to Offline Handwritten text recognition?—We found that approximately 23.5% utilized different techniques in Digital Image Processing (DIP) in addition to the optical model [30, 34, 35, 42, 50, 54, 55, 57]. This approach provides great flexibility for usage across various datasets while maintaining low computational costs. However, it is important to note that the presented methods have limitations due to the text structure they are applied to. That is, the higher the recognition level, the more difficult it is for transformation functions to generate new images without losing the text’s content or structure.

Transfer Learning was another field of study, but less explored (8.8%) [40, 52, 55], since it is not the main objective to use it as data augmentation. In any case, the challenge with this approach is to leverage the previous knowledge of the optical model to retrain it on another dataset. Initially, the proposal proved effective in simple scenarios, or at least with the same degree of image text pattern similarity. However, this approach became challenging to implement under restricted dataset, requiring a larger volume of data to improve performance. On the other hand, a recent study showed an improvement in the results achieved when applied to datasets in the Urdu language [52]. It is worth noting that the authors highlighted the potential of transfer learning in handwriting tasks, especially with limited training data.

In the end, the most widely used approach among the reviewed studies was applying deep learning to synthesize handwritten text images, accounting for roughly 67.6%. Initially, the generative models had some limitations, including text length and cursive style, and required high computational costs. However, the models have undergone significant improvements as text image generators and can presently generate text images of arbitrary size, content, and cursive writing style [27,28,29, 31,32,33, 36,37,38,39, 41, 43,44,45,46,47,48,49, 51,52,53, 56, 58]. Figure 8 shows the proportion of data augmentation approaches used by studies.

Fig. 8
figure 8

Proportion of data augmentation approaches used by studies. Each work may have more than one approach associated

In the Deep Learning domain, we observed three types of models applied to synthesize handwriting images. The Transformer model was one of the least employed approaches over time, with 13.0% of the reviewed studies [31, 36, 46]. In general, these models were used by offline handwritten text recognition works to boost data augmentation. In addition, we consider this as an initial approach within the field of handwritten text synthesis research. In contrast, Diffusion models, although only emerging in the year of 2023, already represent 13.0% of the reviewed studies [37, 44, 53]. This recent and significant growth indicates its potential for future applications. Finally, Generative Adversarial Networks (GANs) were extensively explored and developed over time [27,28,29, 32, 33, 38, 39, 41, 43, 45, 47,48,49, 51, 52, 56, 58], which represent 74.0% of the reviewed studies. Through the reviewed studies and the extensive adoption of GANs, we have observed a growing trend toward more realistic synthesis of handwriting images, with a simultaneous focus on reducing computational costs. This trend motivates for further research in the field and refinements in its application. Figure 9 shows the proportion of Deep Learning models applied in synthesizing handwriting images.

Fig. 9
figure 9

Proportion of deep learning models applied to synthesize handwriting images in reviewed studies

In general, DIP and Transfer Learning approaches are limited to the content of the dataset itself, either by applying transformations to an existing image, or using the knowledge learned from an optical model in another. On the other hand, works based on deep learning involve synthesizing images of handwritten text from scratch using the cursive style learned from the dataset. This versatility makes its application more comprehensive (Fig. 10).

Fig. 10
figure 10

Examples of synthetic handwritten text images created by the reviewed works. a Visual comparison of the results obtained from the studies by Alonso et al. [46], Fogel et al. [27], Liu et al. [29], and Luo et al. [49]. b Handwritten words generated by Gan et al. [28] using reference-guided synthesis. d Handwriting style interpolation in the work of Kang et al. [31]. c Images generated by Gui et al. [53] of Chinese character samples with different content and writer guidance scales

RQ4: What are the Current Challenges in Data Augmentation Applied to Offline Handwritten Text Recognition?—Our analysis focused on the challenges faced in studies related to generative models, as DIP methods have already been extensively explored in the field of offline handwriting recognition research. Therefore, we identified three main gaps in the current literature: (i) low computational cost; (ii) integration between the synthesizer model with the optical model; and (iii) application to restricted datasets.

In this regard, the computational cost has been a less explored feature. A few studies have examined the performance provided by the generator model, particularly when applied to an offline handwriting recognition system. This kind of analysis has only started recently in the latest studies, but in an isolated manner, that is, without considering the optical model.

The second gap identified is related to focus on the integration between the generator model and the optical model in an end-to-end system. A few explorations have been done on continuous and adaptable integration between the two models, often resulting in two independent workflows. In other words, the pipeline for learning and generating synthetic handwritten text images is executed first, and only then, the optical model makes use of the generated data.

Finally, the third gap identified was the applicability of proposed models to restricted datasets, which offer a limited volume of data as a challenge in handwriting recognition. Deep learning models face significant challenges in such scenarios due to the absence of large-scale data samples. This is the type of situation that tends to benefit the most from data augmentation, but has been under-addressed.

Conclusion

Data augmentation is a topic that currently presents various nuances, varying according to the application domain. Furthermore, data augmentation techniques have the potential to be applied in various related fields, such as handwriting recognition, writer identification, keyword spotting, and more. Each of these areas, although sharing some similarities, has its own peculiarities and specific requirements. Thus, we presented a systematic literature review on data augmentation applied to offline handwritten text recognition. We consider the following main contributions:

  • Scope definition of a systematic literature review on data augmentation applied to offline handwritten text recognition;

  • Exploration of the used datasets and recognition levels to synthesize handwriting images;

  • Analysis of data augmentation approaches and the synthesis of handwritten text images over the past decade in the offline handwritten text recognition research field;

  • Identification of current gaps and challenges in the literature, which led us to suggest future research directions to address them.

Initially, 976 papers were collected from five academic databases using relevant keywords for the research field. After a four-step exclusion process, 32 papers were selected and reviewed. Additionally, the quality evaluation scored the papers between 0 and 10 points, in which the highest score obtained was 7.75 [27].

Through the selected works, we explored and described relevant aspects of each study. We mapped the datasets and levels of handwriting recognition most commonly used, and consequently, the most used languages as well. This allowed us to relate and analyze each proposed method within its specific application context.

Based on the study conducted, it can be concluded that Digital Image Processing methods are practical and improve optical models in the training process. However, the data augmentation approach through Generative Adversarial Networks is the new trend in the synthesis of handwritten text images realistically. This approach has the potential to open new research, and its use with optical models is highly promising.

It should be emphasized that the field of offline handwritten text recognition with a central focus on data augmentation is still relatively new. Nevertheless, we have observed a trend in this research area in recent years, accompanied by significant progress in the application of Generative Adversarial Networks as generators of synthetic images of handwritten text. This trend indicates an increasing interest of the academic community in the benefits of combining these research lines.

In conclusion, a future work perspective is related to using low-volume datasets, where generating synthetic images of handwritten text can benefit optical model training. Another relevant aspect is associated with the development of generative models integrated with optical models, following a self-supervised learning approach.