Handwritten Stenography Recognition and the LION Dataset

Purpose: In this paper, we establish a baseline for handwritten stenography recognition, using the novel LION dataset, and investigate the impact of including selected aspects of stenographic theory into the recognition process. We make the LION dataset publicly available with the aim of encouraging future research in handwritten stenography recognition. Methods: A state-of-the-art text recognition model is trained to establish a baseline. Stenographic domain knowledge is integrated by applying four different encoding methods that transform the target sequence into representations, which approximate selected aspects of the writing system. Results are further improved by integrating a pre-training scheme, based on synthetic data. Results: The baseline model achieves an average test character error rate (CER) of 29.81% and a word error rate (WER) of 55.14%. Test error rates are reduced significantly by combining stenography-specific target sequence encodings with pre-training and fine-tuning, yielding CERs in the range of 24.5% - 26% and WERs of 44.8% - 48.2%. Conclusion: The obtained results demonstrate the challenging nature of stenography recognition. Integrating stenography-specific knowledge, in conjunction with pre-training and fine-tuning on synthetic data, yields considerable improvements. Together with our precursor study on the subject, this is the first work to apply modern handwritten text recognition to stenography. The dataset and our code are publicly available via Zenodo.


Introduction
Stenography, or shorthand, is a method commonly used for speed writing, which manifests itself in many different languages and systems.Because it can be written rapidly, shorthand has traditionally been used by secretaries and reporters, for example in parliament and law courts, for recording testimonies and interviews, or for dictation in business correspondence.A stenographer typically develops a personal style that goes beyond the commonly observed handwriting variations, such as size and slant, and entails an individual bank of abbreviations and innovations.These allow the notetaker some privacy from the uninitiated.n a ng i j a l a j o n a t a n i a, av t, det n j, jag o, och å ng Fig. 1 Excerpt from Melin's system.Top: selected characters, shortforms (av, det, och, jag), and an n-gram (ng); bottom: two examples, jonatan and nangijala, in handwritten stenography; colours visualise the extent of the respective character, as indicated in the transliteration below.
For Swedish author Astrid Lindgren (1907 -2002) who wrote and edited all her literary fiction in the Swedish stenographic system of Melin, shorthand provided access to a private intellectual and creative space -Lindgren's own version of a "room of one's own" [1].Since Lindgren was the editor and publisher of her own books [2], and did all her editing in her shorthand notepads, the revisions, deletions, and additions they display, constitute the only first hand source to the author's creative process [3,4].Lindgren herself has described her stenographic notes as impossible to interpret.This myth, frequently reproduced in general Lindgren reception, has been dispelled through research within the ongoing digital humanities project "The Astrid Lindgren Code" (2020-2023) [5], where different and mixed methods are applied to approach Lindgren's shorthand [6,7].
In this work, we study Lindgren's manuscripts from a handwritten text recognition (HTR) perspective.Prior research in stenography recognition has primarily been focused on the English stenographic systems Gregg's and Pitman's and has, so far, been limited to symbol and individual word recognition [8][9][10][11][12].Contrary to this, we focus our experiments on line-level transliterations and present a first baseline, employing a state-of-the-art deep learning model [13].We build upon this baseline by extending the training process with domain knowledge, primarily founded in visual aspects of the Melin system, as well as synthetically generated training lines.
All of our experiments are based on the novel LION dataset, which is published in conjunction with this work.This low-resource corpus consists of a selection of Lindgren's drafts, containing portions of her well-known novel for children, The Brothers Lionheart, as well as excerpts from selected other texts.
Our contributions in this work can be summarised as follows: • the introduction of the novel, stenographic LION dataset (section 3) • the establishment of a line-based stenography recognition baseline (section 4) • the integration of stenographic domain knowledge into the training process (section 5) • further improvement of results via pre-training and fine-tuning (section 6) Furthermore, the LION dataset and our code are publicly available via Zenodo (cf.section 8).

Related Works
Our work is placed at the intersection of document image processing for stenography and handwritten text recognition.Below we present a summary of relevant, related works for both topics.

Document Image Analysis for Stenography
Early approaches in document image analysis for stenography date back to the 1980s, for example by Leedham and Downton [8].The research in this field has, so far, been limited to the two main English shorthand systems, Pitman's and Gregg's.Considering the research that has been published within the past ten years, i.e. since 2012, we have identified the following four works.In 2012, Htwe et al. investigated the use of Bayesian networks as part of the recognition pipeline for Pitman's shorthand [9].
Zhai et al. proposed a pipeline to perform word-level recognition of Gregg's shorthand [10].For this, they combined a convolutional neural network (CNN), as feature extractor, with a recurrent neural network (RNN) as sequence generator, and refined the generated hypotheses, using a word retrieval module.In addition to their proposed pipeline, Zhai et al. released Gregg-1916, a word-level dataset that consists of 15 711 word images that were extracted from a printed Gregg's shorthand dictionary.
Montalbo and Barfeh employed Canny edge detection and a CNN to classify 100 commonlyused legal words and phrases, written in Pitman's stenography [11].Following a similar line of research, Padilla et al. investigated the use of Inception-v3 to classify 135 legal terms, written in Gregg's shorthand [12].
Lastly, we recently published our precursor study, which uses the LION dataset to investigate the effect of commonly-used HTR data augmentations [7].Besides our prior work, we are not aware of any literature that investigates the use of state-of-the-art text recognition techniques for any system of stenography.

Handwritten Text Recognition
Handwritten text recognition approaches (HTR) range from identifying individual symbols (e.g.[14]) to page-level recognition (e.g.[15]).In this work we are focusing on line-based recognition, because our dataset is annotated and segmented at this level.Deep learning-based, line-level text recognition can be further divided into three main categories, based on the employed approaches.
Secondly, sequence-to-sequence (seq2seq) methods that use similar network configurations as the aforementioned approach but add an additional RNN to decode the output sequence.Approaches in this category have for example been proposed by Michael et al. [21] and Chowdhury et al. [22].
In this work, we focus on CTC-based models, as these have been shown to generally perform well, especially in low-resource settings [13,24,25], as is the case for the LION dataset.

Dataset
In this section, we introduce the novel LION dataset, which is the first of its kind in several regards.Firstly, it is the first dataset containing a portion of Astrid Lindgren's original drafts and handwriting.Secondly, it is the first to present text, written in the Swedish stenography system Melin.Finally, it is the first publicly available dataset, covering a substantial amount of handwritten lines in any kind of stenographic system.

Interdisciplinary Context
In the following, we contextualise the LION dataset as an object for digital humanities and literary studies, and discuss briefly to which extent the dataset can be considered representative for both Melin shorthand and shorthand in general.We present characteristics of Lindgren's vocabulary and style, and provide selected examples of challenges connected to inconsistencies in Lindgren's shorthand writing.

Mixed Methods
Preserved stenographic material can be of varying historical or cultural significance, but notable examples of canonised authors who have made use of shorthand in their writing process include for example Fyodor Dostoyevsky [27] and Charles Dickens [28,29].The providing of access to Astrid Lindgren's shorthand manuscripts is motivated by their status as prominent cultural heritage, and finds further relevance from the perspectives of children's literature, book and media history, and textual and genetic criticism.The latter means entering "the workshop" [30] of the writer and drawing attention to the labour, craftsmanship, and dynamics of the creative process.Usually it involves the process of organising and making accessible the documents that precede a book's publication, which is achieved by compiling and deciphering relevant documents, establishing a chronological order, and then transcribing and editing the texts [31].Work that is essential for the compilation of reliable editions or the supplementing of important work of literature with annotations.
The making of children's books has specific features and poses specific questions that have not yet been systematically addressed in geneticcritical studies.The potential of such focus lies in the possibility of a better understanding of how children's literature is created and what considerations dominate the decision-making process in terms of for example content, character development, construction of setting, or narrative voice and style [32].From the point of view of children's literature and genetic criticism, drafts to The Brothers Lionheart are especially interesting due to the novel's seminal position in Lindgren's ouvre, its radical content, the author's well-known difficulties in bringing the novel to its end, and the controversies upon its publication and reception [3,33].From the perspectives of genetic criticism, HTR, and expert crowdsourcing, the limited but proportionally large amount of shorthand notepads containing drafts to the novel (55 in total) provides a sufficient amount of material/data whilst also enabling a sustainable crowdsourcing life-cycle [6].By mixing the best features of all aforementioned methods, a more coherent, multifaceted whole is created.

Stylistic Overview of Literary Works
For Lindgren, stenography forged a link between vocalisation and writing which is likely to have favoured oral elements of her work in general [34].Dialect, folk songs, psalms, and jokes as well as linguistic and onomatopoetic innovations are all recurrent elements of Lindgren's fiction.The LION dataset represent four different works by Lindgren which are written for different purposes and target groups, belong in different genres, and consequently also differ somewhat in style and vocabulary.
The vocabulary in the excerpt from the text On our grove is primarily characterised by the pastoral description of the Swedish grove which it contains, with specific names of flowers, trees, animals, and berries whereas the portion from Samuel August from Sevedstorp and Hanna in Hult is slightly more complex in style as it includes allusions, quotes and song lyrics directed toward an older audience, referring for example to historical rural, oral, and religious tradition.The excerpt from Emil of Lönnerberga consists of an adaptation of the novel for either screen or stage, and includes song lyrics ("Bomsicka bom"), directions and dialogue, occasionally written in dialect (for instance a use of "dä" and "di" instead of standard Swedish spelling "det" and "de").As for The Brothers Lionheart, the vocabulary and style is in line with Lindgren's often expressed credo: that authors of children's books should write in ways that children can easily understand and relate to.A guiding stylistic principle for Lindgren was to use "common words to say uncommon things" [4].Consequently, the vocabulary used by Lindgren in this novel is relatively simple and straightforward.More unusual words include characters (i.e. the dragon Katla), fictional place names (i.e.Nangijala), and occasionally made-up words and compound words significant both for Lindgren and for the flexibility of the Swedish language.In Melin shorthand, the stenographic representation of compound words may vary depending on for example writeability or space, and exist both as joint and split forms(e.g.duvdrottningen/duv drottningen).
The phonetic and colloquial spelling of Melin and Lindgren's ideas of how to write for and about children are connected.Many words in narrator and protagonist Skorpan's vocabulary consist of phonetically simplified words, mirroring children's colloquial language.For example word images such as "huvet" ("huvudet" -"the head") or "nitti" ("nittio" -"ninety"), which might be perceived as shorthand abbreviations, but have in fact have been transposed into Lindgren's fiction in the following phase of typing up.

Writing and transliterating the Melin system
The Melin stenographic system in which Lindgren wrote was the standard system taught in Sweden during the 20th century, and has consequently been widespread among secretaries, journalists, and clerical staff.Expert volunteer transliterators of The Astrid Lindgren Code are recruited from this group [6].
Developed by Olof W. Melin in 1890-1892, the Melin system is based partially on the German Gabelsberger system, works according to the frequency of particular sounds in the Swedish language, and uses phonetic symbols to represent vowels, consonants, and consonant combinations, as well as a wide range of abbreviations, prefixes, and suffixes.As is the case with for example Gabelsberger as well as Gregg's and Pitman, stenographers using Melin will deconstruct what they hear, reconstruct it as a sequence of phonetic symbols and shortforms, and finally translate their shorthand notes into typed up longhand.Lindgren's shorthand is on the one hand representative of Melin in terms of following the system's standard closely, but on the other characterised by a tendency to "spell out" phonemes of shorthand rather than abbreviating them or relying on, for the system, more advanced shortforms.An obscuring factor is how Lindgren's "sloppy" handwriting often affects the proportions of scale and slant, which in Melin shorthand are central in producing distinction and meaning.

Quantitative Overview
The presented dataset consists of 198 digitised pages, from eight notepads of the manuscript collection of Astrid Lindgren [35].The originals, which are part of a collection of 670 notepads, are held at the Astrid Lindgren Archives at the Swedish National Library, and are being made available through the project The Astrid Lindgren Code [5].As outlined above, the eight notepads of LION contain excerpts from drafts to four of Lindgren's literary works.We have assigned the following shorthand titles to each of these works, for ease of referencing: On our grove, indicated as "Autobiography-1"; Samuel August from Sevedstorp and Hanna in Hult, indicated as "Parents-1"; Emil of Lönneberga, indicated as "Emil-1" and "Emil-2"; and The Brothers Lionheart, indicated as "Lionheart-[1-6]".
Figure 2 presents an overview of the content distribution in pages, across the eight notepads.In the case of The Brothers Lionheart, portions of the first six chapters, indicated in the figure by their respective number, are contained.Notepads 432 and 434 cover sections of the first two, respectively, three, chapters, including three transitional pages, where the previous chapter ends and the subsequent one continues within the same page, as indicated by "Lionheart-1|Lionheart-2", respectively "Lionheart-2|Lionheart-3". Regarding the content stemming from The Brothers Lionheart, it should also be noted that for chapters one, two and three, two versions each are contained in the dataset.
For the other three works, the notepads only contain short excerpts, spanning a few pages each.In contrast to Lionheart-[1-6], the numbering here does not indicate a chapter relation but is simply used as a counter, and, in the case of Emil, to differentiate between two portions of text, written in two different notebooks (i.e.435 and 437).Lastly, it should be noted that some of the presented notepads originally contain private notes and letters by the author.For privacy reasons, these have been removed and are not shared or considered in this work.
All of the pages have been segmented into handwritten lines (cf.subsection 3.3 Data Preparation), resulting in 2900 separate images that can for example be used for handwritten text recognition and document image processing.Several of the lines bear a variety of editorial marks, summarised in Figure 4. Concretely, about 10% of the lines contain at least one word that has been struck-through, indicating the author's intention to delete it.Two examples for varying degrees of strikethrough, and their impact on readability, are shown in Figure 5. Another 10% of the lines entail additions, i.e. words written above, and occasionally below, the already written line, indicating corrections or additions.The latter 10% often also contain struck-through portions, where one or more words are replaced by an addition.However, we do not differentiate further in this regard and combine all of these in the general additions category.Examples for differing amounts of additions, and combinations with strikethrough, are shown in Figure 6.
Lastly, as presented in Figure 4, the dataset contains a further 3% of lines, which are indicated as missing.For these, the transliterations are incomplete, for example due to severe obfuscations by strikethrough strokes.These 87 lines are included in the data repository, alongside their partial transliterations.However, we do not consider them further in this work and exclude them from all experiments presented below.The 2197 lines that do not feature any of the characteristics above, are denoted as clean.Figure 3 shows two examples of such lines.

Data Preparation
The preparation of the dataset entails the collection of transliterations, the acquisition and line-level segmentation of the archival images, and the combination of the two pieces of information to obtain the final, annotated data.Each of these steps is briefly summarised below.

Transliterations
The transliterations were provided by trained stenographers, via expert crowdsourcing in a peer editing process [6].It should be noted here that we use the term "transliteration", instead of the commonly used "transcription", as the former describes the representation of one alphabet in another, which is the case when converting stenography to Latin characters.During the transliteration process, the stenographers indicated struck-through words and additions, both of which were converted to a suitable dataformat afterwards.Besides this, the experts indicated line breaks in the transliterations, corresponding to the line endings in the page images.

Image Acquisition and Segmentation
The page images were digitised by the Swedish National Library, using a copy stand (i.e.topdown) camera setup.All eight notepads follow the same general layout, an example of which is shown in Figure 7.The sheets of toned paper are bound together by a spiral binding.Each page contains ruling, in the form of 15 red, printed lines that are evenly spaced, except for a larger margin towards the top and bottom of the page.These landmarks, together with the distinct edges of the notepad against the digitisation background, were used to perform a preliminary segmentation of the page and its individual lines.The majority of the words were written in lead pencil, with the exception of a few sections where Lindgren used a blue ballpoint pen.Word segmentation proposals were obtained using a combination of thresholding, morphological operations and connected component labelling.The word bounding boxes and their assignment to a segmented line, were manually refined and proofread.

Alignment
As a final step in the dataset preparation process, line-level annotations were obtained by combining the segmented lines with the transliterated text-lines.We did not perform further manual alignment steps, to connect word-level transliterations with their corresponding bounding boxes, due to limited proofreading capacities.However, the bounding box coordinates are included in the data repository.In the case of clean and struck lines, word-level annotations of utilisable confidence may be obtained by sequentially combining the transliterations and bounding boxes within a given line.Due to the challenging reading-order of lines, containing additions, such an automated approach is expected to fail for this specific line type.

Data Splitting Considerations
Besides providing the raw data, consisting of page and line images and corresponding transliterations, we propose a number of datasplits that can be used for various deep learning tasks.We designate a portion of the data to be used during training and hyperparameter fine-tuning, using five-fold cross validation.The remainder of the data is set aside as test set for the final evaluation.
Considering the unbalanced distribution of content types (cf.subsection 3.2), the datasplits have been arranged in a way to allow the investigation of model performances on the majority content type, The Brothers Lionheart, and the generalisation to the other included literary works by Astrid Lindgen.
Based on these considerations, we designate all lines belonging to chapter four of The Brothers Lionheart (i.e.Lionheart-4) as the in-vocabulary test set, referred to as "Test-LH".All lines belonging to Autobiography-1, Parents-1, Emil-1 and Emil-2 are set aside as out-of-vocabulary test set, "Test-OOV".
The proposed datasplits are summarised in Table 1.The lines within each datasplit can either be considered as a whole, i.e. combining clean, struck and added lines, or in various subset combinations, such as only clean lines.For convenience and reproducibility, corresponding lists for all of these combinations are included in the data repository.

Brief Quantitative Analysis of Textual Content
In order to quantitatively summarise the textual contents of the cross-validation and the two test sets (Test-LH, Test-OOV), we perform a brief linguistic analysis of the three sets of documents.
For this, we firstly remove all stop words, using the list provided by NLTK [36] for Swedish.Afterwards, we calculate the Term Frequency -Inverse Document Frequency (TF-IDF) [37] scores for the remaining words in the three partitions.Each text can then be represented as a vector of documentspecific TF-IDF scores.Calculating the pairwise cosine similarity quantifies the similarity between the respective documents.Figure 8 summarises these pairwise scores.As can be seen from the figure, the cross-validation and Test-LH portions share a considerable overlap, whereas Test-OOV is noticeably different from both.This matches the aforementioned content descriptions, with the two former documents stemming from the same corpus, and the latter being a combination of three distinct other corpora.
Taking the list of the ten words with the highest TF-IDF scores per document, presented in Table 2, into consideration, the observations above are further emphasised.The lists for the cross-validation split and Test-LH feature figures ("Jonatan", "Sofia") and place names ("Nangijala", "Körsbärsdalen") that are central to The Brothers Lionheart.In contrast to this, the list for Test-OOV contains a combination of central figures and places from Emil of Lönneberga (Emil and his parents -"mamma", "pappa" -and the tool shed -"snickerboa"), and the names of Lindgren's parents ("Hanna" and "Samuel August").The one word that occurs in all three lists, "mej", exemplifies the use of colloquial spellings, mentioned initially.The standard spelling for this word is "mig" (English: me, myself).In this regard, it should be noted, that we do not control for alternative spellings in our initial filtering of stop words.Therefore, "mej", which is considered a stop word in its official spelling, still appears in this TF-IDF-based word list.

Baseline: Handwritten Stenography Recognition
In order to establish a baseline for handwritten stenography recognition on the LION database, we train and evaluate a state-of-the-art HTR model on the set of clean lines.

Model: Gated-CNN-BGRU
All of the experiments presented in this paper are performed using a slightly modified version of the Gated-CNN-BGRU architecture, proposed by Neto et al. [13].This model has been shown to perform well in limited-resource settings of similar extent as the LION database.Furthermore, this architecture outperformed other CTC-based approaches in our initial experiments.The architecture consists of two major components, shown in detail in Figure 9, that are trained in an endto-end fashion.Based on prior experiments, we replace the originally proposed Batch Renormalisation [38] layers with regular Batch Normalisations [39].Furthermore, unlike Neto et al. [13], we employ best path decoding [16], instead of a language model, to obtain the transliterations, in order to focus on the performance of the text recognition approach.

General Training and Evaluation Protocol
The baseline model is trained for up to 100 epochs, using the AdamW [40] optimiser with a learning rate of 0.001, a batch size of eight and the standard CTC loss.All line images are preprocessed by first converting them to the HSV colour space and obtaining a single-channel image by discarding the hue and saturation channels.This step was performed instead of a regular greyscale conversion as it removes most of the printed red rulings (cf. Figure 7).Afterwards, the remaining value channel is inverted and the contrast is stretched, using the second and 98th intensity percentiles as boundaries.
During training, the validation CTC loss is measured after each epoch.Following an initial warm-up period of ten epochs, early stopping with a patience of ten epochs is applied, based on the validation loss.The model weights of the bestperforming validation epoch are preserved for the final evaluation.In order to better ascertain the variability of the model performance, training is repeated from scratch 30 times per fold, yielding a total of 150 sets of weights, and thus results.All trained models are evaluated on the test sets, measuring the Character Error Rate (CER) and Word Error Rate (WER), which are defined as follows: where S, D and I are the number of character (word) substitutions, deletions and insertions, respectively, that need to be performed to convert a given text to a reference text.The sum of these transformations corresponds to the Levensthein distance [41].N indicates the amount of characters (words) in the reference string.For both metrics, lower values are better, with 0 being the optimum.

Experiment
As mentioned above, we limit our experiments on the new LION dataset to the portion of clean lines.Besides excluding lines with any form of strikethrough or additions during training and validation, we also exclude these lines when reporting the model's performance on the test set.We chose to limit the data for this first investigation, in order to rule out any potential side-effects of the altered lines.Furthermore, of the currently available datasets either do not contain such forms of alterations (e.g.Saint Gall [42]), or explicitly exclude them, for example by providing placeholder transcriptions, like "#", for struckthrough words (e.g.IAM [43]).The recognition performance for struck and added lines is briefly discussed in section 7.

Baseline Results and Analysis
Table 3 shows the mean CER and WER of the baseline experiment.As shown in the table, there is a noticeable difference in performance between Test-LH and Test-OOV, with the model performing considerably worse on the latter portion.
Regarding the overall performance (third row), it can be noted that both error rates are noticeably higher than the rates obtained on commonly-used benchmark datasets of similar size, such as Saint Gall [42], for which for example Neto et al. achieve an error rate around 4% [13].The comparably high error rates illustrate the challenging nature of the LION dataset.Additionally, the reduced performance on the out-of-vocabulary test set (Test-OOV) gives an indication of the difficulty of stenography recognition itself.

Encoding Stenographic Domain Knowledge
The baseline experiment, presented in the previous section, treats the stenography recognition as a traditional transcription problem.So far, we have not considered any of the aspects that are inherent to stenography and the Melin system, and that set it apart from other scripts, such as Latin.One major characteristic of this stenographic system is that the symbol set is considerably larger than that of the Swedish alphabet, resulting in a one-to-many mapping of symbols to characters in the Swedish transliterations, and thus also in the CTC decoding step.In order to investigate whether more direct mappings, closer to a one-to-one relationship, can improve the recognition performance, we have selected four groups of such mappings and implement target sequence encodings, inspired by diplomatic transcriptions (cf.e.g.[44]).

Encoding Schemes
Firstly, we consider words that share the same visual representation as a character symbol, for example "och" (and ), being written as the symbol for "o", jag (I ) as "j" and "var" (where, was) as "v".
In total, we have identified 14 such shortforms and their corresponding characters, which are shown in Table 5.The second row in Table 4 shows an example of applying this encoding technique, which we term shortform.Although shortforms may also be used when the word appears as a prefix (e.g.överallt), we limit the encoding to isolated occurrences of the respective words.The decoding of prefix occurrences is ambiguous (e.g.ögon vs ö[ver]allt) and would therefore require additional language knowledge to definitively decode a given string.
For the second encoding scheme, referred to as suffix, we selected four frequently appearing suffixes, "-are", "-ing", "-en" and "-et", and replace each of the occurrences with its own symbol, as demonstrated in the third row of Table 4.The first two prefixes were primarily chosen because they are represented by their own symbols in the Melin standard.In addition to this, the latter two are included because they are often indicated by leaving out the "e" and simply appending an "n", respectively "t" as terminating character.Although this cannot be considered a separate symbol of its own, it does result in a one-tomany character mapping like the other explored encodings.
As a third encoding method, termed n-gram, we investigate the 31 n-grams, shown in Table 6, for which the Melin system defines its own symbols.We emulate this symbol assignment by replacing the respective n-gram occurrences with separate symbols in the target sequences.An example for this is shown in row four of Table 4, where the encoding of the n-gram "nkt" is visualised as "&".
Lastly, we combine a variety of (sub-) words for which the Melin system defines its own symbols, creating a more extensive set of transformations, as compared to the aforementioned three, therefore termed Melin.Concretely, we consider commonly-used shortforms, as well as words that are represented by their own symbols.Besides this, we consider a selection of affixes and n-grams for which the Melin standard defines individual symbols.Table 7 summarises the considered words and n-grams, and the final row in Table 4 demonstrates the application of this encoding.
Overall, it should be noted that these four encoding schemes were chosen and implemented in a manner that makes them fully reversible, i.e. all altered strings can be unambiguously decoded to their original representation by replacing the respective placeholder symbols with the characters they represent.A variety of other encodings schemes are conceivable within the Melin standard, however these are often not unambiguously decodable, or would require the integration of additional language knowledge into the decoding process.

Experiments
We use each of the encoding schemes, introduced above, in order to create alternative target text representations.These encoded texts are then used to train four versions of the Gated-CNN-BGRU, using the training protocol introduced above (subsection 4.2).We adapt the alphabet size for each of the experiments according to the respective encoding, for example increasing it by 14 symbols for the shortform encoding.Before calculating the CER and WER on the test set, the predicted text lines are decoded back to the regular Swedish alphabet, i.e. inverting the encoding step and obtaining a regular Swedish text representation.

Results and Analysis of Different Encoding Schemes
Table 8 summarise the results for the four encoding methods.Generally, the error rates lie within ±0.3 percentage points (pp) of the baseline performance.A notable exception is the WER obtained by the Melin encoding scheme, which yields a considerable improvement of 3pp.A potential explanation for this improvement lies in the shortform portion of the encoding scheme (first group in Table 7).An inspection of the (mis-) spellings of these words reveals that roughly 74% of these words are spelt correctly when using the Melin encoding scheme, an increase of approximately 13% over the baseline recognition of the same words.
Considering the overall frequency of words that are transformed by the different encoding schemes, it can be noted that some of the considered symbols only appear in very small numbers, for example a few tens for some of the n-grams, out of the almost 10 000 words.This low number of samples may be a contributing factor to the small impact of the different encoding schemes.In an attempt to mitigate this, we therefore study the recognition performances further, using a larger, synthetic training set in the last set of experiments, discussed in the following section.

Training on Synthetic Data
As outlined above, we expand the training set, in order to better study the potential of our proposed encoding schemes.A commonly-used approach for increasing the training set size is to use data augmentation techniques.We studied this approach extensively in our prior work and only obtained small, albeit significant improvements [7].In the following, we explore an alternative approach, based on the recombination of individual words.

Dataset Creation
In order to create a variety of synthetic line images and corresponding transliterations, we firstly segment the original training lines, using the bounding boxes provided with the dataset.Word labels are obtained by aligning the segmented word images with the provided transliterations.As mentioned initially, this level of annotations has not been proofread.We therefore discard all lines, a total of 20, for which the number of bounding boxes differs from the number of words in the transliteration.While the pool of remaining words is not guaranteed to be perfectly annotated, most of the alignments can be expected to be correct, due to prior proofreading efforts.
Based on the obtained pool of annotated word images, new text lines are created via random combinations.Line-breaks are inserted such that the resulting text lines have a similar character count as the original ones.Before pasting the word images onto a blank canvas, slight transformations are applied, using the previously identified augmentations: rotation, scaling and shifting [7].Overall, each original word image is used ten times in different word contexts to generate new lines.Regarding the content of the newly generated lines it should be noted that these are combined in an unconstrained manner, not taking any linguistic considerations into account.
We apply this generation process to the training sets of the five folds, yielding five sets of larger datasets, each containing around 9 400 line images.The validation and test sets are not altered by the generation process and remain in their original configuration.

Experiments
The training and evaluation follow the same protocol as above, with the exception of replacing the training sets with the newly generated, synthetic ones for each of the folds.Following the initial pre-training stage on the synthetic data, a finetuning step is applied, using the genuine, original training set of the respective fold.The fine-tuning stage follows the same parameters as all other experiments, with the exception of removing the warm-up period, as the models are not being trained from scratch, as is the case for all other experiments.
Where applicable, statistical hypothesis testing is performed based on the mean line CER, respectively WER, using Wilcoxon paired signedrank tests [45] with a Bonferroni correction [46] to correct for multiplicity.

Results and Analysis
The results of the combined pre-traning and finetuning step are summarised in Table 9. Comparing the encoding-free performance with the baseline (Table 3), clear improvements for both metrics can be observed.Similarly, each of the examined encoding schemes outperforms its baseline counterpart (Table 8) by 3.8 − 5.3pp for the CER and 6.4 − 8.3 pp for the WER.
The three encoding schemes shortform, suffix and Melin significantly (p < 0.01) outperform the encoding-free setup, when applied in conjunction with pre-training and fine-tuning.Similar to the original encoding experiments, a considerable improvement of 3.3 pp is achieved for the WER by the Melin scheme.In a traditional recognition setup, decreases in the WER are often coupled with considerably larger decreases for the CER, due to a difference in scaling (character vs. word count).However, for the Melin scheme, the CER improvement is more modest, compared to the one for the WER.One explanation for this lies in the considered shortforms, which amount to almost 30% of the words in the test sets.All of these words are represented by a single symbol, each.A correctly recognised shortform symbol will therefore result in a correctly recognised word, assuming that the spaces alongside it are also correctly identified.This will positively impact both metrics.At the same time, the incorrect recognition of a regular character as a shortform (or vice-versa) will have a much larger adverse effect on the CER than a regular character-character confusion, as illustrated in the example in Table 10.This effect also applies to the shortform encoding, however, as this only entails the 14 character shortforms, which amount to roughly 15% of the test words, the impact is not as drastic, as for the Melin scheme.
Lastly, considering the n-gram encoding, the obtained results are worse, respectively differ only marginally from the encoding-free version.One conceivable explanation is that the encoding was applied in a naive fashion, replacing any occurrences of a given n-gram with its respective symbol.This may result in substitutions that are not in line with the Melin system, for example, when an n-gram occurs as the result of a compund word, such as "ns" in "sammansättning" (samman + sättning).In order to investigate this further, the integration of language knowledge, possibly on a phonetic level, or more fine-grained stenography annotations, e.g. at symbol level, will be required.
Overall, three of the four examined encoding schemes have significantly improved the recognition performance, demonstrating the positive impact of carefully integrating stenographic knowledge into the text recognition process.At the same time, the positive impact of pre-training on a larger, synthetic dataset emphasises the challenge of low-resource datasets.

Recognition of Struck and Added Lines
A detailed study of the recognition performance for struck and added lines, as well as the impact of their presence during training, is beyond the scope of this paper.However, to provide a general performance overview of this considerable portion of the dataset (around 20% of lines), Table 11 presents the CER and WER for the struck and added lines of the test set, using the pre-trained and finetuned model, without any encoding applied.For convenience, the error rates for the clean lines are repeated in the final row.The recognition performance is considerably lower for both types of lines than that for the clean lines.Considering the struck lines, one option for improving the performance can lie in the removal of strikethrough strokes, as for example proposed in [47,48].In addition to this, including a substantial amount of struck lines in the training set may enable the model to recognise struck words regardless of the occlusion.
In the case of added lines, we do not expect the latter approach to work, as a large portion of the recognition issues stem from the structure of the chosen model, which outputs one character per time-step, without any spatial information.Referring back to Figure 6 (top), it can be observed that in the first quarter of the line, two characters need to be recognised per timestep, including the information which of the transliterated lines a recognised character belongs to.Approaches for recognising paragraphs or whole pages have for example been proposed by Yousef and Bishop [15], and Bluche [49] and could be an interesting starting point for future work in this regard.

Conclusion
In this work we have established a baseline for handwritten text recognition for stenography, using the newly introduced LION dataset.We have studied the effect of integrating stenographybased domain knowledge, in the form of selected encoding schemes, derived from visual aspects and rules of the Melin writing system.In addition to this, we have investigated the use of pre-training and fine-tuning, using generated line images, to increase the volume of the, otherwise low-resource, LION dataset.Based on our experiments and the obtained results, we draw the following conclusions: 1. Automatically transliterating handwritten stenography poses a challenging task to handwritten text recognition.2. Pre-training on generated line images, followed by a fine-tuning step using genuine data, improves the recognition performance considerably.3. Combining pre-training and fine-tuning with selected, stenography-based target sequence encoding schemes yields further, significant recognition improvements.
Overall, it can be noted that the transliteration of stenography is not straightforward, as the writing system consists of an extensive set of rules, unlike many other, previously examined scripts.Despite this, we have demonstrated that handwritten stenography recognition is possible.The produced transliterations can, for example, be used as a basis for human proofreading, thus considerably reducing the time and effort, required to process a document.
This aspect is especially relevant, as the firsthand, applied knowledge of stenography is steadily decreasing.It is therefore crucial to utilise these skills and experiences while they are still readily available, in order to process as many stenographic documents as possible, ensuring future access to the material.
Unlocking the stenographic notes of Astrid Lindgren, one of Sweden's most renowned authors, is of significance from a cultural heritage perspective.Beside this, however, handwritten stenography recognition is also of relevance for areas such as political history and genealogy, for example in the form of stenographed court records, respectively personal diaries.
We therefore make the LION dataset publicly available with the aim of encouraging future research in handwritten stenography recognition.Potential avenues for future work are, for example, approaches that generate entirely new, unseen word images using the Melin system, or that integrate phonetic domain knowledge into the recognition process.Besides this, investigating linguistic approaches, for example in the form of a language model, may be of interest.
Data Availability.The LION dataset is available in the following Zenodo repository: submission in progress.Furthermore, the code for the presented experiments can be obtained from this Zenodo repository: https://doi.org/10.5281/zenodo.8249817.

Fig. 2
Fig. 2 Distribution (in pages) of the different literary works, across the eight notepads.

Fig. 4
Fig. 4 Distribution of lines into the four different categories -clean, struck, added and missing.

Fig. 5
Fig. 5 Examples for lines containing varying amounts and styles of strikethrough.

Fig. 6
Fig. 6 Examples for lines containing individual additions (top), and in combination with struck-through words (bottom).

Fig. 7
Fig. 7 Sample page, demonstrating the original contrast, metal binding at the top, and red printed rulings.

Fig. 8
Fig.8Cosine similarity between the TF-IDF vectors (excluding stop words) of the respective datasplits.

Fig. 9
Fig. 9 Summary of the Gated-CNN-BGRU architecture as used in the presented experiments.

Table 1
Line count per datasplit and line type.

Table 2
Top-10 terms, weighted by TF-IDF score, for each of the datasplits.

Table 3
CER and WER (in %) for the baseline experiment, using the Gated-CNN-BGRU, trained on the original, clean lines.

Table 4
Demonstration of applying the four different encoding schemes on a sample string.

Table 5
List of shortforms and the character by which they are replaced during the shortform encoding step.

Table 6
List of n-grams considered for the n-gram encoding scheme.N-grams are resolved from biggest to smallest, to avoid conflicts due to overlaps, e.g."nskt" vs "skt".

Table 7
Summary of the Melin encoding scheme.Groups from top to bottom: shortforms, prefixes, suffixes, n-grams.

Table 8
Summary of CER and WER (in %) for the different encoding schemes and the encoding-free baseline, trained and evaluated on the original, clean lines.

Table 9
Summary of CER and WER (in %) for the different pre-trained and fine-tuned encoding schemes and the encoding-free baseline.

Table 11
Summary of CER and WER (in %) for the three line types, struck, added and clean, based on the pre-trained and fine-tuned encoding-free model.