1 Introduction

Historical documents possess an outstanding cultural value. They are a unique public asset, forming the collective and evolving memory of our societies [44]. For this reason, with the aim of converting these documents to digital form, many tasks revolve around the processing of historical documents.

One of such task is language modernization. Due to the evolving nature of human language, historical documents are mostly limited to scholars. Thus, in order to make these documents available to a broader audience, language modernization aims to automatically generate a new version of a given document written in the modern version of its original language. However, while it succeeds in helping non-experts to understand the content of a historical document, language modernization is not error-free.

Similarly, another task related to the processing of historical documents is spelling normalization. Besides the evolving nature of human language, spelling conventions were not created until recently. Therefore, orthography changes depending on the author and time period, which could lead to an astonishing variety for writing a given word (e.g., Laing [28] pointed out more than 500 different forms recorded for writing the preposition through). These linguistic variations are present in historical documents and have always been a concern for scholars in humanities [6]. Spelling normalization tackles this problem by adapting a document’s spelling to modern standards. However, it is still not able to produce error-free normalizations.

In both cases, scholars need to correct the system’s outputs in those cases in which error-free modernized/normalized versions are needed. With the aim to help scholars to generate these error-free versions, we propose to deploy the interactive machine translation (IMT) collaborative framework into these tasks. In this methodology, a human and a translation system work together to produce the final translation.

This work builds upon Domingo and Casacuberta [14], which applied the IMT framework to language modernization. Our contributions are as follows:

  • Further study the integration of prefix-based and segment-based IMT into language modernization.

  • Integration of prefix-based and segment-based IMT into spelling normalization.

2 Related work

While it has been manually applied to the literature for centuries (e.g., The Bible has been adapted and translated for generations in order to preserve and transmit its contents [21]), automatic language modernization is a young research field. A shared task for translating historical text to contemporary language [54] was one of the first related works. However, although they approached language modernization using a set of rules, the task was focused on achieving an orthography consistency on the document’s spelling. Domingo and Casacuberta [10] proposed a neural machine translation (NMT) approach. Sen et al. [47] augmented the training data by extracting pairs of phrases and adding them as new training sentences. Domingo and Casacuberta [13] proposed a method to profit from modern documents to enrich the neural models and conducted a user study. Lastly, Peng et al. [36] proposed a method for generating modernized summaries of historical documents.

Some approaches to spelling normalization include creating an interactive tool that includes spell checking techniques to assist the user in detecting spelling variations [3]. There is a combination of a weighted finite-state transducer, combined with a modern lexicon, a phonological transcriber and a set of rules [40]. There is a combination of a list of historical words, a list of modern words and character-based statistical machine translation (SMT) [46]. A multitask-learning approach using a deep bi-long short-term memory (LSTM) [23] is applied at a character level [7]. Ljubešic et al.  applied a token/segment-level character-based SMT approach to normalize historical and user-created words [31]. Korgachina applied rule-based machine translation (RBMT), character-based machine translation (CBMT) and character-based neural machine translation (CBNMT) [27]. Domingo and Casacuberta [11] evaluated word-based and character-based MT approaches, finding character-based to be more suitable for this task and that SMT systems outperformed NMT systems. Tang et al. [53], however, compared many neural architectures and reported that the NMT models are much better than SMT models in terms of character error rate (CER). Finally, Hämäläinen et al. [22] evaluated SMT, NMT, an edit-distance approach, and a rule-based finite state transducer and advocated for a combination of these approaches to make use of their individual strengths.

The IMT framework was introduced during the TransType project [17] and was further developed during TransType2 [4]. New contributions to this framework include developing new generations of the suffix [56] and profiting from the use of the mouse [45]. Marie et al. [32] introduced a touch-based interaction to iteratively improve translation quality. Lastly, Domingo et al. [15] introduced a segment-based protocol that broke the left-to-right limitation. With the rise of NMT, the interactive framework was deployed into the neural systems [25, 39], adding online learning techniques [38]; and reinforcement and imitation learning [29].

3 Approaches

In this section, we present and describe our different proposals to tackle language modernization and spelling normalization. All approaches rely on machine translation (MT), whose framework approximates a probability distribution using a mathematical model whose parameters are estimated from a collection of parallel data, in order to compute the translation probability (Pr) of the target sentence given a source sentence.

Thus, given a source sentence \(x_{1}^{J}\), MT aims to find the most probable translation \({\hat{y}}_{1}^{\hat{I}}\) [8]:

$$\hat{y}_{1}^{{\hat{I}}} = \mathop {\arg \max }\limits_{{y_{1}^{I} }} \Pr \left( {y_{1}^{I} |x_{1}^{J} } \right)$$
(1)

3.1 Language modernization

We confront language modernization from an MT perspective: The language of the original document would be the source language, and the modernized language would be the target language. With this in mind, we propose two different approaches based on SMT and NMT.

3.1.1 SMT approach

This approach is based on SMT, which uses models that rely on a log-linear combination of different models [34]. For years, this has been the prevailing approach to compute Eq. (1). Among others, it mainly combines phrase-based alignment models, reordering models and language models [58].

Given a parallel corpus in which for each original document its modernized version (parallel at a line level) is also available, this approach tackles language modernization as a conventional translation task: We train an SMT system using the original documents as the source part of the training data and their modernized versions as the target data.

3.1.2 NMT approaches

These approaches rely on NMT, which make use of neural networks to model Eq. (1). Its most frequent architecture is based on an encoder–decoder, featuring recurrent networks [2, 51], convolutional networks [19] or attention mechanisms [57]. At the encoding state, the source sentence is projected into a distributed representation. Then, at the decoding step, the decoder generates its most likely translation—word by word—using a beam search method [51]. The model parameters are typically estimated via stochastic gradient descent [43], jointly on large parallel corpora. Finally, the system obtains the most likely translation at decoding time by means of a beam search method.

Like the SMT approach, these approaches tackle language modernization as a conventional translation task but using NMT instead of SMT. Additionally, since the scarce availability of parallel training data is a frequent problem for historical data [7] and since NMT needs larger quantities of parallel training data than we have available (see Sect. 5.1), we followed Domingo and Casacuberta’s [13] proposal for enriching the neural models with synthetic data: We apply feature decay algorithm (FDA) [5] to a monolingual corpus in order to filter it and obtain a more relevant subset. Then, we follow a back-translation approach [48] to train an inverse system—using the modernized version of the training dataset as source, and the original version as target. Following that, we translate the monolingual data with this system, obtaining a new version of the documents which, together with the original modern documents, conform the synthetic parallel data. After that, we train a NMT modernization system with the synthetic corpus. Finally, we fine-tune the system by training a few more steps using the original training data.

We made use of two different NMT modernization approaches, whose difference is the architecture of the neural systems:

  • NMT\(_\textrm{LSTM}\): This approach uses a recurrent neural network (RNN) [23] architecture with LSTM cells.

  • NMT\(_\textrm{Transformer}\): This approach uses a transformer [57] architecture.

3.2 Spelling normalization

We tackle spelling normalization similarly to language modernization (see Sect. 3.1). However, since in spelling normalization changes frequently occur at a character level, we followed a CBMT strategy. Due to spelling normalization being a much simpler problem than MT, we decided to use the simplest approach: splitting words into characters and considering each character as a token. Then, we consider the language of the original documents as the source language and its normalized version as the target language.

3.2.1 CBSMT approach

Like in language modernization’s SMT approach (see Sect. 3.1.1), given a parallel dataset of historical documents and their normalized equivalents, this approach tackles spelling normalization as a conventional translation task—considering the document’s language as the source language and its normalized version as the target language. In this case, however, we follow a character-based statistical machine translation (CBSMT) strategy: The document’s words are split into characters and, then, conventional SMT is applied.

3.2.2 CBNMT approaches

These approaches are similar to the language modernization’s NMT approaches, but using a CBNMT strategy to model Eq. (1). Additionally, since CBNMT also needs larger quantities of parallel training data than we have available (see Sect. 5.1), we followed Domingo and Casacuberta’s [12] proposal for enriching the neural models with synthetic data: Given a collection of modern documents from the same language as the original document, we train a CBSMT system using the normalized version of the training dataset as source and the original version as target. We, then, use this system to translate the modern documents, obtaining a new version of the documents. This new version, together with the original modern document, conforms a synthetic parallel data which can be used as additional training data. After that, we combine the synthetic data with the training dataset, replicating several times the training dataset in order to match the size of the synthetic data and avoid over-fitting [9]. Finally, we use the resulting dataset to train the enriched CBNMT system.

Like in language modernization, we made use of two different CBNMT modernization approaches, whose difference is the architecture of the neural systems:

  • CBNMT\(_\textrm{LSTM}\): This approach uses a RNN architecture with LSTM cells.

  • CBNMT\(_\textrm{Transformer}\): This approach uses a transformer [57] architecture.

4 Interactive machine translation

In this work, we deploy the IMT framework into language modernization and spelling normalization. This framework proposes a collaborative process in which a human translator works together with an MT system to generate the final translations. Thus, we can adapt it to language modernization and spelling normalization to create a collaborative framework between scholars and the modernization/normalization systems. In this section, we present and describe the two different IMT protocols we made use of: prefix-based and segment-based.

4.1 Prefix-based IMT

The prefix-based protocol proposes an iterative framework in which users correct the leftmost wrong word from a translation hypothesis, and the system generates a new hypothesis taking into account the user’s feedback. Initially, the system proposes a translation hypothesis \(y_{1}^{I}\) of length I. The user, then, reviews this hypothesis and corrects the leftmost wrong word \(y_i\). With this correction, they are inherently validating all the words that precede the corrected word, forming a validated prefix \({\tilde{y}}_1^{i}\), that includes the corrected word \({\tilde{y}}_i\). The system immediately reacts to this user feedback (\(f={\tilde{y}}_1^{i}\)), generating a suffix \({\hat{y}}_{i+1}^I\) that completes \({\tilde{y}}_1^{i}\) to obtain a new translation of \(x_{1}^{J}:{\hat{y}}_i^I={\tilde{y}}_1^{i}\,{\hat{y}}_{i+1}^I\). This process is repeated until the user accepts the system’s complete suggestion.

The suffix generation was formalized by Barrachina et al. [4] as follows:

$$\hat{y}_{{i + 1}}^{I} = \mathop {\arg \max }\limits_{{I,y_{{i + 1}}^{I} }} \Pr \left( {\tilde{y}_{1}^{i} {\mkern 1mu} y_{{i + 1}}^{I} |x_{1}^{J} } \right)$$
(2)

This equation is very similar to Eq. (1): At each iteration, the process consists in a regular search in the translations space but constrained by the prefix \({\tilde{y}}_1^{i}\).

Similarly, Peris et al. [39] formalized the neural equivalent as follows:

$$\begin{aligned} \small p\left( \hat{y_{i'}} \mid {\hat{y}}_1^{i'-1}, x_1^J, f ={\tilde{y}}_1^i;\;\Theta \right) = {\left\{ \begin{array}{ll} \delta (\hat{y_{i'}}, {\tilde{y}}_{i'}), &{} \text{ if } {i' \le i} \\ \bar{\textbf{y}}^\top _{i'} \textbf{p}_{i'} &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(3)

where \(x_1^J\) is the source sentence; \({\tilde{y}}_1^i\) is the validated prefix together with the corrected word; \(\Theta\) are the models parameters; \(\bar{\textbf{y}}^\top _{i'}\) is the one hot codification of the word \(i'\); \(\textbf{p}_{i'}\) contains the probability distribution produced by the model at time-step i; and \(\delta (\cdot , \cdot )\) is the Kronecker delta.

This is equivalent to a forced decoding strategy and can be seen as generating the most probable suffix given a validated prefix, which fits into the statistical framework deployed by Barrachina et al. [4].

4.2 Segment-based IMT

The segment-based protocol extends the human–computer collaboration defined in the previous protocol (see Sect. 4.1). Besides making a word correction, the user is now able to validate segments (sequences of words) and combine consecutive segments to create a larger one.

As in the prefix-based protocol, the process starts with the system suggesting an initial translation. The user, then, reviews it and validates those sequences of words which they consider to be correct. Then, they are able to delete words between validated segments to create a larger segment. After that, they make a word correction.

These three actions constitute the user feedback, which Domingo et al. [15] formalized as: \({\tilde{{\textbf{f}}}}_1^N = {\tilde{{\textbf{f}}}}_1,\dots ,{\tilde{{\textbf{f}}}}_N\); where \({\tilde{{\textbf{f}}}}_1,\dots ,{\tilde{{\textbf{f}}}}_N\) is the sequence of N correct segments validated by the user in an interaction. Each segment is defined as a sequence of one or more target words. Therefore, each user action modifies the feedback differently:

  1. 1.

    Validating a new segment, inserting a new segment \({\tilde{{\textbf{f}}}}_i\) in \({\tilde{{\textbf{f}}}}_1^N\).

  2. 2.

    Merging two consecutive segments \({\tilde{{\textbf{f}}}}_i\), \({\tilde{{\textbf{f}}}}_{i+1}\) into a new one.

  3. 3.

    Introducing a word correction. This is introduced as a new one-word validated segment, \({\tilde{{\textbf{f}}}}_i\), which is inserted in \({\tilde{{\textbf{f}}}}_1^N\).

While the first two actions are optional—an iteration may not have new segments to validate—the last action is mandatory: It triggers the system to react to the user feedback, starting a new iteration of the process.

The system reacts to the user’s feedback by generating a sequence of new translation segments \({\widehat{{\textbf{h}}}}_0^{N+1} = {\widehat{{\textbf{h}}}}_0,\dots ,{\widehat{{\textbf{h}}}}_{N+1}\). That means, an \({\widehat{{\textbf{h}}}}_i\) for each pair of validated segments \({\tilde{{\textbf{f}}}}_i, {\tilde{{\textbf{f}}}}_{i+1}\), being \(1\le i \le N\); plus one more at the beginning of the hypothesis, \({\widehat{{\textbf{h}}}}_0\); and another at the end of the hypothesis, \({\widehat{{\textbf{h}}}}_{N+1}\). The new translation of \(x_{1}^{J}\) is obtained by alternating validated and non-validated segments: \({\hat{y}}_1^I={\widehat{{\textbf{h}}}}_0,{\tilde{{\textbf{f}}}}_1,\dots ,{\tilde{{\textbf{f}}}}_N,{\widehat{{\textbf{h}}}}_{N+1}\). The goal is to obtain the best sequence of translation segments, given the user’s feedback and the source sentence:

$$\begin{aligned} {\widehat{{\textbf{h}}}}_0^{N+1} = \mathop {\arg \max }\limits_{{h_{0}^{{N + 1}} }} {\text{Pr}}\left( {\textbf{h}}_0,{\tilde{{\textbf{f}}}}_1,\dots ,{\tilde{{\textbf{f}}}}_N,{\textbf{h}}_{N+1} \mid x_{1}^{J}\right) \end{aligned}$$
(4)

This equation is very similar to Eq. (2). The difference is that, now, the search is performed in the space of possible substrings of the translations of \(x_{1}^{J}\), constrained by the sequence of segments \({\tilde{{\textbf{f}}}}_1,\dots ,{\tilde{{\textbf{f}}}}_N\), instead of being limited to the space of suffixes constrained by \({\tilde{y}}_1^{i}\).

Similarly, Peris et al. [39] formalized the neural equivalent of this protocol as follows:

$$\begin{aligned} p\left( y_{i_n+i'} \mid y_1^{i_n+i'-1}, x_1^J, f_1^N;\;\Theta \right) =\textbf{y}^\top _{i_n + i'} \textbf{p}_{i_n + i'} \end{aligned}$$
(5)

where \(f_1^N = f_1, \dots , f_N\) is the feedback signal and \(f_1, \dots , f_N\) are a sequence of non-overlapping segments validated by the user; each alternative hypothesis y (partially) has the form \(y = \dots , f_n, h_n,f_{n+i},\dots\); \(g_n\) is the non-validated segment; \(1 \le i' \le {\hat{l}}_n\); and \(l_n\) is the size of this non-validated segment and is computed as follows:

$$\begin{aligned} \small {\hat{l}}_n = \mathop {\arg \max }\limits_{{0 \le l_{n} \le L}} \frac{1}{l_N + 1} \sum _{i' = i_n + 1}^{i_n + l_n + 1}\log p\left( y_{i'} \mid y_1^{i' - 1}, x_{1}^{J};\;\Theta \right) \end{aligned}$$
(6)

5 Experimental framework

This section presents the details of our experimental session. We start by describing the corpora used for training our models. Then, we present the evaluation metrics used for assessing our proposal. After that, we detail the training procedure of our MT systems. Finally, we describe how we performed the user simulation.

5.1 Corpora

In our experimental session, we made use of the following corpora:

  • Language modernization:

    • Dutch Bible [54]: A collection of different versions of the Dutch Bible. Among others, it contains a version from 1637—which we consider as the original version—and another from 1888—which we consider as the modern version (using nineteenth-century Dutch as if it were modern Dutch).

    • El Quijote [10]: the well-known seventeenth-century Spanish novel by Miguel de Cervantes, and its correspondent twenty-first-century version.

    • OE-ME [47]: contains the original eleventh-century English text The Homilies of the Anglo-Saxon Church and a nineteenth-century version—which we consider as modern English.

  • Spelling normalization:

    • Entremeses y ComediasFootnote 1 [16]: A seventeenth-century Spanish collection of comedies by Miguel de Cervantes. It is composed of 16 plays, 8 of which have a very short length. Each line corresponds to the same line from its original manuscript.

    • QuijoteFootnote 2 [16]: The seventeenth-century Spanish two-volumes novel by Miguel de Cervantes. Each line corresponds to the same line from its original manuscript.

Each corpus consists in a collection of historical documents and their correspondent versions in which either its language has been modernized or its spelling normalized. Therefore, each document contains two different versions whose content is parallel at a line level: the original document and its modernized/normalized counterpart.

Additionally, to enrich the neural models we made use of the following modern documents: the collection of Dutch books available at the Digitale Bibliotheek voor de Nederlandse letterenFootnote 3, for Dutch; and OpenSubtitles [30]—a collection of movie subtitles in different languages—for the rest of them. Table 1 contains the corpora statistics.

Table 1 Corpora statistics.

5.2 Evaluation metrics

We made use of the following well-known metrics in order to assess our proposal:

  • Word Stroke Ratio (WSR) [55] measures the number of words edited by the user, normalized by the number of words in the final translation.

  • Mouse Action Ratio (MAR) [4] measures the number of mouse actions made by the user, normalized by the number of characters in the final translation.

Additionally, to evaluate the initial quality of the modernization systems, we used the following well-known metrics:

  • BiLingual Evaluation Understudy (BLEU) [35] computes the geometric average of the modified n-gram precision, multiplied by a brevity factor that penalizes short sentences. In order to ensure consistent BLEU scores, we used sacreBLEU [41] for computing this metric.

  • Translation Error Rate (TER) [49]: computes the number of word edit operations (insertion, substitution, deletion and swapping), normalized by the number of words in the final translation. It can be seen as a simplification of the user effort of correcting a translation hypothesis on a classical post-editing scenario.

Finally, we applied approximate randomization testing (ART) [42]—with 10, 000 repetitions and using a p-value of 0.05—to determine whether two systems presented statistically significance.

5.3 MT systems

We trained SMT and CBSMT systems with Moses [26], following the standard procedure: We estimated a 5-gram language model—smoothed with the improved KneserNey method—using SRILM [50], and optimized the weights of the log-linear model with MERT [33]. SMT systems were used both for the SMT modernization approach and for generating synthetic data to enrich the neural systems (see Sect. 3.1.2).

To built NMT and CBNMT systems, we used NMT-Keras [37]. We used long short-term memory units [20], with all model dimensions set to 512 for the RNN architecture. We trained the system using Adam [24] with a fixed learning rate of 0.0002 and a batch size of 60. We applied label smoothing of 0.1 [52]. At inference time, we used beam search with a beam size of 6. In order to reduce vocabulary, we applied joint byte pair encoding (BPE) [18] to all corpora, using 32, 000 merge operations.

For the transformer architecture [57], we used 6 layers; transformer, with all dimensions set to 512 except for the hidden Transformer feed-forward (which was set to 2048); 8 heads of transformer self-attention; 2 batches of words in a sequence to run the generator on in parallel; a dropout of 0.1; Adam [24], using an Adam beta2 of 0.998, a learning rate of 2 and Noam learning rate decay with 8000 warm up steps; label smoothing of 0.1 [52]; beam search with a beam size of 6; and joint BPE applied to all corpora, using 32, 000 merge operations.

5.4 User simulation

Due to the time and economic costs of conducting frequent human evaluations during the development stage, we conducted an evaluation with simulated users. These users had as goal to generate the modernizations/normalizations from the reference.

5.4.1 Prefix-based simulation

The simulation starts with the system offering an initial hypothesis. Then, the user compares it with the reference, looking for the leftmost wrong word. When they find it, they make a correction, validating a new prefix in the process. The cost associate to this correction is of one mouse action and one word stroke. After this, the system reacts to the user’s feedback by generating a new suffix that completes the prefix to conform a new modernization/normalization hypothesis. This process is repeated until the hypothesis and the reference are the same.

To conduct this simulation, we used Domingo et al.’s [15] updated version of Barrachina et al.’s [4] softwareFootnote 4 for the SMT and CBSMT systems, and NMT-Keras [37]’s interactive branch for the NMT and CBNMT systems.

5.4.2 Segment-based simulation

For the sake of simplicity and without loss of generality, in this simulation we assumed that the user always corrects the leftmost wrong word and that validated word segments must be in the same order as in the reference. This assumption was also made by the original authors [15].

Like the previous simulation, the process starts with the system offering an initial hypothesis. Then, the user validates segments by computing the longest common subsequence [1] between this hypothesis and the reference. This has an associated cost of one action for each one-word segment and two actions for each multi-word segment. After this, the user checks if any pair of consecutive validated segments should be merged into a single larger segment (i.e., they appear consecutively in the reference but are separated by some words in the hypothesis). If there are, then they merge them, increasing mouse actions in one for each merge in which there was a single word between the segments or two otherwise. Finally, they correct the leftmost wrong word. Then, the system reacts to this feedback by generating a new hypothesis. This process is repeated until the hypothesis and the reference are the same.

To conduct this simulation, we made use of Domingo et al.’s [15] softwareFootnote 5 for the SMT and CBSMT systems, and NMT-Keras’s [37] interactive branch for the NMT and CBNMT systems.

Table 2 Experimental results of our language modernization IMT approaches.

6 Results

In this section, we present the results of the evaluation conducted for each task.

6.1 Language modernization

Table 2 presents the results of deploying the IMT framework into language modernization. It showcases the initial quality of each modernization system and compares their performance using the prefix-based or the segment-based framework.

The SMT approach obtained the best results by a large margin. The prefix-based protocol yields a reduction of the human effort of creating error-free modernizations. Additionally, the segment-based protocol obtains even larger reduction of the typing effort, at the expenses of a small increase in the use of the mouse—which is believed to have a smaller impact in the human effort [15].

Regarding the NMT approaches, despite that all of them yield a successful reduction of the human effort, these diminish significantly smaller than the ones obtained by the SMT approach. Furthermore, the segment-based protocol does not offer any benefit with respect to the prefix-based—both protocols have the same typing effort—while it has a significant increase in the mouse usage. Most likely, this is related to the system’s modernization quality being smaller than the SMT system.

Finally, as already mentioned, it is worth noting the quality gap between the SMT and the NMT approaches—specially for the Dutch Bible dataset. While we created synthetic data to enrich the neural models (see Sect. 3.1.2), the scarce availability of historical training data is a known problem [7] that seems to have a bigger impact on the neural models, which have a tendency to need larger quantities of parallel training data. On the other hand, the SMT models need fewer resources and are capable of better exploiting the available data (specially given the particularities of this task).

6.2 Spelling normalization

Table 3 presents the results of deploying the IMT framework into spelling normalization. It presents the initial quality of each normalization system and compares their performance using the prefix-based or the segment-based protocol.

Table 3 Experimental results of our spelling normalization IMT approaches.

In the case of Entremeses y Comedias, the CBSMT approach yielded the best results for both protocols. For Quijote, all approaches had a similar behavior. When comparing protocols we observe that, while in all cases the IMT framework successfully reduced the human effort needed to generate error-free normalization, both protocols presented a similar typing effort. Most likely, this is due to the highest initial quality of the systems: Since there are fewer errors to correct, using one methodology over the other one is not so relevant as when there are more errors. However, the segment-based protocol comes with a small increase in the mouse effort, since it has a more complex user protocol.

Fig. 1
figure 1

Example in which both protocols successfully reduced the effort of generating error-free modernizations

7 Qualitative analysis

In this section, we present a more in-depth study of the system’s behaviors in the different tasks.

7.1 Language modernization

Figure 1 showcases an example in which the IMT framework significantly reduces the human effort of generating an error-free modernization of an old English document. While modernizing the sentence from scratch has a cost of 14 word strokes and one mouse action, and correcting the automatic modernization costs 7 word strokes and 7 mouse actions, the cost is reduced to 6 word strokes and 6 mouse actions using the prefix-based protocol, and 3 word strokes and 15 mouse actions—which have a smaller impact in the human effort—with the segment-based protocol.

Figure 2 showcases an example in which only the prefix-based approach is able to reduce the human effort. Modernizing this old Spanish sentence from scratch has an associated cost of 21 word strokes and 1 mouse action, while post-editing the automatic modernization would cost 7 word strokes and 7 mouse actions. The prefix-based protocol is able to reduce the effort by 1 word stroke and 1 mouse action. However, the segment-based protocol maintains the typing effort while increasing the mouse effort to 28 mouse actions. This is due to a known weakness in this protocol, in which the system may fail to properly handle the user correction if they consist in out-of-vocabulary words.

Finally, Table 4 reflects the human effort needed to generate error-free modernizations. In all cases, the IMT framework significantly reduces the typing effort than generating the modernizations from scratch, at the cost of increasing the mouse effortFootnote 6 However, it is believed that the mouse has a smaller impact in the human effort [15].

Table 4 Statistics of the effort needed to generate the error-free modernizations.
Fig. 2
figure 2

Example of a case in which only the prefix-based protocol is able to reduce the human effort of generating error-free modernizations

Regarding the different IMT protocols, we can observe how the total mouse effort gets reduced by half with the segment-based protocol, while increasing the mouse effort by two times in the cases of El Quijote and OE-ME, and three times in the case of Dutch Bible. While these results have been obtained under a simulated environment, we believe that the effort reductions obtained by the segment-based protocol are significant enough to consider this protocol the most suitable for this task. Nonetheless, we would like to deepen in this study in a future work conducting a human evaluation, which would allow us to take into consideration other factors such as the time taken by each approach.

7.2 Spelling normalization

Figures 3 and 4 showcase some examples of generating error-free spelling normalizations using the interactive framework. As reflected in Table 3, all approaches and protocols yielded similar results. Since the systems have a high normalization quality, the orthography inconsistencies that need to be normalized typically consist in a few characters per sentence—with most sentences already yielding an error-free normalization.

Fig. 3
figure 3

Example of normalizing the spelling of a sentence from Entremeses y Comedias

Finally, Table 5 reflects the human effort needed to generate error-free modernizations. Like in the language modernization task, the IMT framework always succeeds in reducing the typing effort. Moreover, in this case the prefix-based protocol is also able to reduce the mouse effort, while the segment-based approach doubles the total number of mouse actions.

Overall, both IMT protocols yielded similar results. While the number of mouse actions in the segment-based protocol is considerably larger than in the prefix-based one, this difference is not statistically significant when normalizing by the number of characters (as reflected by the MAR metric at Table 3). Thus, while the prefix-based protocol seems to perform better on this task, a human evaluation—which would allow us to measure additional factors such as the time taken—needs to be conducted prior to arriving to a categorical conclusion.

Fig. 4
figure 4

Example of normalizing the spelling of a sentence from Quijote

Table 5 Statistics of the effort needed to generate the error-free normalizations.

8 Conclusions and future work

With the aim of helping scholars to generate error-free modernizations/normalizations, in this work we have deployed the interactive framework into two tasks related to the processing of historical documents: language modernization and spelling normalization. We deployed two different protocols to several MT modernization and normalization approaches.

Results show that the IMT framework always succeeded in reducing the human effort. For language modernization, the SMT approach yielded the best results under the segment-based protocol, reducing the typing effort in around two to ten points. In the case of spelling normalization, due to the high quality of the systems, all approaches and protocols behave similarly.

Finally, in a future work we would like to conduct a human evaluation with the help of scholars to better assess the benefits of applying the interactive framework to language modernization and spelling normalization.