1 Introduction

Handwriting provides language information based on structured symbols and is used for communication or documentation of speech. HWR refers to the digitalization of written text and can be categorized into offline and online HWR. Research on offline HWR systems is very advanced and has almost reached a human-level performance, but cannot be applied for real-time recognition applications (as they induce an unacceptable delay) [23] as the written text has first to be digitalized. Optical character recognition (OCR), one of the dominant approaches in offline HWR, deals with the analysis of the visual representation of handwriting only. Its application and accuracy are limited as it cannot make use of temporal information such as writing direction and speed, or the pressure applied to the paper [65, 69].

In contrast, online HWR typically works on different types of spatio-temporal signals such as the positions of the pen tip (in 2D), its temporal context or the movement on the writing surface. These handwriting signals can then, e.g., be partitioned into (indexed) strokes [96]. Compared to offline HWR, OnHWR has its own merits, e.g., the difficult segmentation of cursive written sequences into single characters. Many highly relevant handwriting problems in everyday life require both an informative representation of the writing and well-working classification algorithms [35]. Examples include the verification of signatures, the identification of writers, or the recognition of handwriting.

The representation of written text crucially depends on the way it has been recorded. Many recording systems make use of a stylus pen (a touch pen with a sensible magnetic mesh tip) together with a touch screen surface [2]. Systems for writing on paper are only prototypical, such as the ones used in [11, 79, 100] or the GyroPen [17] that provides a pen-like interaction from standard built-in sensors in modern smartphones. An advanced system for recording online HWR data was proposed by Ott et al. [65] who use a sensor-enhanced ballpoint pen that is extended with inertial measurement units (IMUs). The hand movement and velocity patterns with such a pen are different to air-writing [108]. In this paper, we propose a novel dataset collection of equations and words recorded with an IMU-enhanced pen. Using this pen allows an online representation of the accelerations, orientations and the pressure applied to the pen. Writing styles can thereby be characterized by an information-rich multivariate time-series (MTS). These datasets lay the foundation for HWR from pens with integrated sensors [17, 45, 62,63,64,65, 79, 100, 104], a so far unsolved challenge in machine learning.

Table 1 Overview of state-of-the-art trajectory-based and our inertial-based online handwriting datasets

For machine learning tasks derived from online handwriting data, we distinguish between single-label prediction tasks (i.e., classifying characters, digits and symbols) and tasks to predict sequences of labels (i.e., words, sentences and equations). We here focus on the online seq2seq prediction task for writer-dependent (WD) and writer-independent (WI) classification, but also consider the single-label classification task. Seq2seq models in natural language processing (NLP) and speech recognition [86] are used to convert sequences of Type A to sequences of Type B (e.g., sentences from English to German). Many real-world datasets take the form of sequences, e.g., written texts, numbers, audio or video frame sequences. While many approaches build on language models or lexica [9, 71, 81, 86] that outperform model-free approaches for certain datasets (e.g., sentences), these approaches require additional efforts to properly deal with the data at hand. They cannot handle dialects and informal words off-the-shelf, do not recognize wrongly written words, and require a large corpus volume with large training times to achieve an acceptable accuracy [37]. Even with additional pre-processing, language models and lexica cannot (or only with high effort [92]) be applied to certain types of sequences, e.g., equations, as in our case. For our benchmark baselines we therefore resort to language- and lexicon-free approaches without token passing. More specifically, we provide an evaluation benchmark with CNNs combined with (bidirectional) LSTMs and TCNs, and an attention-based model for the seq2seq OnHWR, as well as several transformers for the single character-based classification task.

The remainder of the paper is organized as follows. We discuss related work in Sect. 2. Section 3 presents our novel collection of online handwriting datasets on sequence level. Section 4 introduces the suggested benchmark models; in particular, we propose several CNN architectures. In Sect. 5 we provide experimental results before we end with a conclusion in Sect. 6.

2 Background and related work

We will first provide an overview of available datasets of online handwriting datasets and explain the particularities for each one. Next, we discuss related methodological approaches to model such data. For a detailed overview of text classification methods we refer to [47, 49].

2.1 Datasets

While there are many offline datasets, online data are rare [35]. To properly evaluate OnHWR methods, we need a multi-label online dataset that allows for the evaluation of tasks for both the writer-dependent and the writer-independent case. Table 1 gives an overview of available online datasets. For the single character prediction task, the Kuchibue [56, 57], MRG-OHTC [53], CASIA [98] and OnHW-chars [65] datasets are available. While the OnHW-chars dataset is rather small, we provide single character-based datasets from a larger database. For our sequence-based method (i.e., a technique that predicts a whole sequence of characters), the IRONOFF [95], ICROW [78], IAM-OnDB [51], LMCA [42], ADAB [1], IBM-UB [84] and VNOnDB [58, 59] word and sentence datasets can be used.

The commonly used IAM-OnDB [51] and VNOnDB [59] datasets only include online trajectory data written on a tablet. However, writing on even and smooth surfaces influences the writing style of the user [28]. To circumvent this disadvantage, we initially recorded a small character-only dataset with a sensor-enhanced pen on usual paper in previous work [65]. In this paper we make use of this novel pen and record sequence-based samples for a comparison and evaluation benchmark with the trajectory-based IAM-OnDB (line level) and VNOnDB-words datasets. Hence, our datasets allow a broad research on sequence-based classification from sensor-enhanced pens and allow the connection between classical OnHW recognition on tablets with sensor-enhanced pens.

2.2 Methods

While hidden Markov models (HMMs) [3, 8, 18, 19, 22] have initially been applied to offline HWR, more recently, deep learning models became predominant, including convolutional neural networks (CNNs) [109], temporal convolutional networks (TCNs) [82, 83], recurrent neural networks (RNNs) [20, 31, 68, 70, 85, 105] including long short-term memorys (LSTMs), bidirectional LSTMs (BiLSTMs) [12, 91] and multidimensional RNNs [32, 97]. More recently, attention models further improved the classification accuracy of RNNs [10], but did not outperform previous approaches for OnHWR. Despite transformers [94] and its variants [13, 36, 38, 44, 90, 101] got very popular for NLP [75] and image processing, these have so far only been applied to offline HWR [38]. The transformer by [66] is based on a language model and is used for Chinese text recognition. Similarly, variational autoencoders (VAEs), RNNs [29] and generative adversarial networks (GANs) [26] have been successfully applied for synthetic offline handwriting generation, but not for the online case so far. For the time-series classification task, standard convolutional architectures [25, 34, 72, 103, 113], spatio-temporal methods [6, 15, 21, 39, 40] and variants [24, 87, 89, 99] as well as transformers [110] have been employed. In [65], we evaluated machine learning techniques, while in this paper we provide a broad evaluation benchmark on classical and novel time-series classification methods previously mentioned. While many approaches predict one class after the other, [14, 54] predicted sequences similar to our approach. This is necessary to construct a suitable loss function described in the following. See Appendix 1 for a more detailed overview of related work.

Fig. 1
figure 1

Exemplary sensor data for the x-, y- and z-axis of the equation: 20583:70

Loss functions For sequence prediction the connectionist temporal classification (CTC) [30, 31, 43] loss combined with beam search [77] has extensively been used. The Edit distance (ED) [48] quantifies how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. The ED allows deletion, insertion and substitution. However, the ED is a discrete function that is known to be hard to optimize. Ofitserov et al. [60] proposed a soft ED, which is a smooth approximation of ED that is differentiable. Seni et al. [80] used the ED for HWR. We use the CTC loss for sequence prediction (see Sect. 4).

3 Datasets and evaluation methodology

Fig. 2
figure 2

Sensor-enhanced pen

Table 2 Overview of our recordings from right-handed writers and state-of-the-art online handwriting datasets for writer-dependent (WD) and writer-independent (WI) tasks

Our datasets are a collection of existing and newly generated online handwriting recordings. Section 3.1 first describes our recording setup to create novel and information-rich datasets. Section 3.2 gives details about the properties of our different OnHW datasets and compares them to existing datasets. Section 3.3 proposes a set of evaluation metrics.

3.1 Recording setup

Our datasets are recorded with a sensor-enhanced pen developed by STABILO International GmbH that contains two accelerometers at the front and the back (3 axes each), one gyroscope (3 axes), one magnetometer (3 axes) and one force sensor at 100  Hz (see Fig. 2). The data recordings contain 14 measurements provided by the sensors: four sensor data signals (each in x, y and z direction), the force with which the pen tip touches the surface, and the timestep at which the tablet receives the data from the pen. Figure 1 shows an exemplary sensor signal from a written equation. Using the force sensor the sensor data allow to separate strokes well as the writer lifts the pen between every character (this is not possible for cursive writing, e.g., for words). In total, we let 131 adult writers participate in our data collection. For more information on the sensor pen and data acquisition, see Appendices 2 and 3.

3.2 Datasets

Table 3 Overview of our datasets from left-handed writers for writer-dependent (WD) and writer-independent (WI) tasks
Fig. 3
figure 3

Exemplary online samples of our OnHW-wordsTraj dataset (left: tablet, middle: camera) and the IAM-OnDB [51] (line level) dataset (right)

We propose a large set of four different sequence-based datasets (see the first four entries in Table 2): the OnHW-equations dataset was part of the UbiComp 2021 challengeFootnote 1 and is written by 55 writers and consists of 10 number classes and 5 operator classes (+, -, \(\cdot \), :, =). The dataset consists of a total of 10,713 samples. While in the OnHW-words500 dataset only the same 500 words per each writer appear, in the OnHW-wordsRandom dataset every sample is randomly chosen from a large German and English word list. This allows the comparison of indirectly learning a lexicon of 500 words or, alternatively, completely lexicon-free learning. The OnHW-wordsRandom dataset (14,641 samples) is smaller than the OnHW-words500 dataset (25,218 samples), but contains longer words with a maximal length of 27 labels (19 labels for OnHW-words500). The train/validation split for the OnHW-words500 dataset is based on words for the WD task such that the same 400 words per writer are in the train set and the same 100 words per writer are in the validation set. For the WI task, the split is done by writer such that all 500 words of a single writer are either in the train or validation set. As it is more likely to overfit on the same words, the WD task of OnHW-words500 is more challenging compared to the OnHW-wordsRandom dataset. The OnHW-words500R dataset is a random split of OnHW-words500.

Additionally, we record the OnHW-wordsTraj dataset that consists of four different data sources. We replace the ink refill with a Wacom EMR module and record online trajectories at 30 Hz on a Samsung Galaxy Tab S4 tablet along with the sensor data. Four cameras pointed on the pen to record the movement of the pen tip at 60 Hz. We manually label the pixels of 100 random images of the recorded videos in the classes “pen“, “pen tip“ and “background“ and train U-Net [76] to predict the pen tip pixels from all images. From this we derive the pen tip trajectory in camera coordinates. Two persons wrote 4257 words in total that results in 16,752 camera samples. With this dataset it is possible to compare results from traditional online trajectory datasets (written on a tablet) with our online sensor pen datasets. Figure 3 exemplarily compares the trajectory and camera data of the OnHW-wordsTraj dataset with the IAM-OnDB [51] dataset. Table 3 gives a dataset overview of left-handed writers. Sample sizes are smaller and ranges between around 3% and 13.4% of the sample sizes of right-handed datasets. For our benchmark, we consider right- and left-handed writers separately and will publish right- as well as left-handed datasets for future research.Footnote 2

Fig. 4
figure 4

Properties of our and state-of-the-art datasets

Figure 4 compares statistical properties, i.e., the number of samples, sample lengths and character distributions, between our dataset and the state-of-the-art datasets. The IAM-OnDB (line level) and VNOnDB-words datasets consist of more samples and total number of characters compared to our OnHW datasets, but at the same time use a higher number of classes (81 and 147). The IAM-OnDB samples have higher lengths (up to 64), and the VNOnDB samples have smaller lengths (up to 11) (see Fig. 4a). The VNOnDB dataset is equally distributed compared to other words datasets (see Fig. 4c), while numbers appear more often than operators in our OnHW-equations dataset (see Fig. 4b). See Appendices A.4 and A.5 for more details on our datasets.

Fig. 5
figure 5

Distribution of samples for the OnHW-chars, OnHW-symbols and split OnHW-equations datasets

Datasets for single character classification For the OnHW-equations dataset, it is possible to split the sensor sequence based on the force sensor as the pen is lifted between every single character. This approach provides another useful dataset for a single character classification task. We set split constraints for long tip lifts and recursively split these sequences by assigning a possible number of strokes per character. This results in a total of 39,643 single characters. Furthermore, we recorded the OnHW-symbols dataset with the same labels (numbers 0 to 9 and operators +, -, \(\cdot \), :, =), written by 27 writers and a total of 2326 single characters. Figure 5 compares the distribution of sample numbers for the OnHW-chars [65] (characters) and OnHW-symbols as well as split OnHW-equations (numbers, symbols) datasets. While the samples are equally distributed for small and capital characters (\(\approx \) 600 per character), the numbers and symbols are unevenly distributed for the split OnHW-equations dataset (similar to Fig. 4b).

3.3 Evaluation metrics

We define a set of task-specific seq2seq and single character-based evaluation metrics that are commonly used in the community. Metrics for seq2seq evaluation are the character error rate (CER) and word error rate (WER) that are based on the Edit distance (ED). The ED is the minimum number of substitutions S, insertions I and deletions D required to change the sequences \(\mathbf {f} = (f_1, \ldots , f_r)\) into \(\mathbf {g} = (g_1, \ldots , g_n)\) with lengths r and n, respectively. The ED is defined by

$$\begin{aligned} \mathrm{ED}_{i,j} = {\left\{ \begin{array}{ll} \mathrm{ED}_{i-1,j-1} \text {for} \,\, f_i = g_j \\ \text {min} {\left\{ \begin{array}{ll} \mathrm{ED}_{i-1,j} + D(f_i) \\ \mathrm{ED}_{i,j-1} + I(g_j) \text {for} \,\, f_i \ne g_j \\ \mathrm{ED}_{i-1,j-1} + S(f_i, g_j) \\ \end{array}\right. } \end{array}\right. } \end{aligned}$$
(1)

for \(1 \le i \le r, 1 \le j \le n\), \(\mathrm{ED}_{i,0} = \sum _{k=1}^i D(f_k)\) for \(1 \le i \le r\), and \(\mathrm{ED}_{0,j} = \sum _{k=1}^j I(g_k)\) for \(1 \le j \le n\) [16]. We define the CER \(=\frac{S_c + I_c + D_c}{N_c}\) as the ED, the sum of character substitutions \(S_c\), insertions \(I_c\) and deletions \(D_c\), divided by the total number of characters in the set \(N_c\). Similarly, the WER \(=\frac{S_w + I_w + D_w}{N_w}\) is computed with word operations \(S_w\), \(I_w\) and \(D_w\) and number of words in the set \(N_w\) [38]. For single character evaluation, we use the character recognition rate (CRR) that is the number of correctly classified characters divided by the total number of characters in the test set.

4 Benchmark methods

This section formally defines the seq2seq classification task and our loss functions. We propose our architecture for HWR from IMU-enhanced pens and describe our data augmentation techniques.

Sequence-based classification task An MTS \(\mathbf {U} = \{\mathbf {u}_1, \ldots , \mathbf {u}_m\} \in \mathbb {R}^{m \times l}\) is an ordered sequence of \(l \in \mathbb {N}\) streams with \(\mathbf {u}_i = (u_{i,1},\ldots , u_{i,l}), i \in \{1,\ldots ,m\}\), where m is the length of the time-series that is variable and l is the number of dimensions. Each MTS is associated with \(\mathbf {v}\), a sequence of L class labels from a pre-defined label set \(\Omega \) with K classes. For our classification task, \(\mathbf {v} \in \Omega ^L\) describes words and equations. The training set is a subset of the array \({\mathcal {U}} = \{\mathbf {U}_1, \ldots ,\mathbf {U}_n\} \in \mathbb {R}^{n \times m \times l}\), where n is the number of time-series, and the corresponding labels \({\mathcal {V}} = \{\mathbf {v}_1,\ldots , \mathbf {v}_n\} \in \Omega ^{n \times L}\). The aim of the MTS classification task is to predict an unknown class label for a given MTS. We train the classifier using the loss \({\mathcal {L}}_\mathrm{CTC}({\mathcal {U}}, {\mathcal {V}})\) [30].

Character-based classification task In contrast to the sequence-based classification task, corresponding labels \({\mathcal {V}}\) for the character-based classification task are of length \(L=1\). We define \(p(i|\mathbf {u})\) to be the predicted probability for the ith class and \(q(i|\mathbf {u})\) to be the true class distribution. We train the classifier using the cross-entropy loss and variants against overconfidence and class imbalance [50, 67, 73, 88, 102, 112].

Sequence-based loss The CTC loss is a solution to avoid pre-segmentation of the training samples. The key idea of CTC is to transform the network outputs into a conditional probability distribution over label sequences. An intermediate label representation allows repetitions of labels and occurrences of blank labels to identify no output label. Hence, the network with the CTC loss has a softmax output layer with one more unit than there are labels. These outputs define the probabilities of all possible ways to align all label sequences with the input sequence. [30]

Fig. 6
figure 6

Overview of our CNN architecture in combination with a (Bi)LSTM or a TCN

Fig. 7
figure 7

Overview of our architecture with multi-head attention

Character-based losses We use the categorical cross-entropy (CCE) loss defined by

$$\begin{aligned} \small {\mathcal {L}}_\mathrm{CCE}({\mathcal {U}}, {\mathcal {V}}) = - \frac{1}{K}\sum _{i=1}^{K} q(i|\mathbf {u}) \log p(i|\mathbf {u}) \end{aligned}$$
(2)

for model training. Samples with softmax outputs that are less congruent with provided labels are implicitly weighted more than confident sample predictions (more emphasis is put on difficult samples with CCE). Hence, more emphasis is put on difficult samples, which can cause overfitting to noisy labels [102, 112]. To account for this imbalance, we modify the CCE loss such that it down-weights the loss assigned to well-classified examples. We use the Focal loss (FL) [50] defined by

$$\begin{aligned}&{\mathcal {L}}_\mathrm{FL}({\mathcal {U}}, {\mathcal {V}}, \alpha , \gamma )\nonumber \\&\quad = - \frac{1}{K}\sum _{i=1}^{K} \alpha _{t} \big (1 - p(i|\mathbf {u})\big )^{\gamma } q(i|\mathbf {u}) \log p(i|\mathbf {u}), \end{aligned}$$
(3)

with class balance factor \(\alpha \in [0,1]\), and the modulating factor \(\big (1 - p(i|\mathbf {u})\big )^{\gamma }\) with focusing parameter \(\gamma \ge 0\). As alternative, we apply label smoothing (LSR) [67] that prevents overconfidence by applying a confidence penalty through a regularization term, yielding

$$\begin{aligned} {\mathcal {L}}_\mathrm{LSR}({\mathcal {U}}, {\mathcal {V}}, \beta )= & {} -\frac{1}{K}\sum _{i=1}^{K} \log p(i|\mathbf {u}) - \beta H\big (p(i|\mathbf {u})\big )\nonumber \\= & {} -\frac{1}{K}\sum _{i=1}^{K} \log p(i|\mathbf {u}) - D_{KL}\big (x||p(i|\mathbf {u})\big ),\nonumber \\ \end{aligned}$$
(4)

with \(\beta \) the strength control of the confidence penalty. Label smoothing is equivalent to an additional Kullback-Leibler (KL) divergence term between a uniformly distributed random variable x and the network’s predicted distribution p. The bootstrapping approach [73] is another alternative for each mini-batch. The soft bootstrapping loss (SBS) is

$$\begin{aligned}&{\mathcal {L}}_\mathrm{SBS}({\mathcal {U}}, {\mathcal {V}}, \beta ) \nonumber \\&\quad = - \frac{1}{K}\sum _{i=1}^{K} \big [\beta q(i|\mathbf {u}) + (1-\beta ) p(i|\mathbf {u})\big ] \log p(i|\mathbf {u}), \end{aligned}$$
(5)

for predicted class probabilities p with weighting parameter \(\beta \), while the hard bootstrapping loss (HBS)

$$\begin{aligned}&\small {\mathcal {L}}_\mathrm{HBS}({\mathcal {U}}, {\mathcal {V}}, \beta ) \nonumber \\&\quad = - \frac{1}{K}\sum _{i=1}^{K} \big [\beta q(i|\mathbf {u}) + (1-\beta ) z_i\big ] \log p(i|\mathbf {u}) \end{aligned}$$
(6)

uses the maximum a posteriori (MAP) estimation of p given \(\mathbf {u}\), with \(z_i := \mathbb {1}[i = \arg \max q_l, l=1, \ldots , K]\). MAP treats every sample equally for a higher robustness against noisy labels. This can lead to longer training times to reach convergence and can make optimization more difficult [112]. The generalized cross-entropy (GCE) [112] loss

$$\begin{aligned} \small {\mathcal {L}}_\mathrm{GCE}({\mathcal {U}}, {\mathcal {V}}, \alpha ) = - \frac{1}{K}\sum _{i=1}^{K} \frac{1-p(i|\mathbf {u})^\alpha }{\alpha } \end{aligned}$$
(7)

with \(\alpha \in (0, 1]\) uses a negative Box-Cox transformation to combine benefits of the MAP and the CCE. The symmetric cross-entropy (SCE) [102] is

$$\begin{aligned} \small {\mathcal {L}}_\mathrm{SCE}({\mathcal {U}}, {\mathcal {V}}, \alpha , \beta ) = \alpha {\mathcal {L}}_\mathrm{CCE}({\mathcal {U}}, {\mathcal {V}}) + \beta {\mathcal {L}}_\mathrm{RCE}({\mathcal {U}}, {\mathcal {V}}) \end{aligned}$$
(8)

based on the reverse cross-entropy (RCE) loss

$$\begin{aligned} \small {\mathcal {L}}_\mathrm{RCE}({\mathcal {U}}, {\mathcal {V}}) = - \frac{1}{K}\sum _{i=1}^{K} p(i|\mathbf {u}) \log q(i|\mathbf {u}), \end{aligned}$$
(9)

aims for a more effective and robust learning, where \(\alpha \) mitigates the overfitting of CCE and \(\beta \) allows for flexible exploration of the RCE. Furthermore, we make use of the joint optimization (JO) [88], which overcomes the noisy labels problem by learning network parameters and labels jointly. The loss is defined by

$$\begin{aligned} \begin{aligned} \small {\mathcal {L}}_\mathrm{JO}(\Theta , {\mathcal {V}}|{\mathcal {U}}, \alpha , \beta ) =&{\mathcal {L}}_\mathrm{CCE}(\Theta , {\mathcal {V}}|{\mathcal {U}}) + \\&\alpha {\mathcal {L}}_{p}(\Theta |{\mathcal {U}}) + \beta {\mathcal {L}}_{e}(\Theta |{\mathcal {U}}) \end{aligned} \end{aligned}$$
(10)

with regularization losses \({\mathcal {L}}_{p}\) and \({\mathcal {L}}_{e}\), and network parameters \(\Theta \).

Fig. 8
figure 8

Data augmentation methods of the original sensor sample (black)

Architectures We propose two different architectures for seq2seq sensor signal classification. For the first method (see Fig. 6), a convolution block consisting of 1D convolutions (200 filter, kernel size 4), max pooling (pool size 2), batch normalization and dropout (with rate 0.2) layers is used. One TCN layer of 100 units, one LSTM layer of 100 units or two BiLSTM layers, each with 60 units, follow to extract the temporal context [74]. While we use tanh activations for BiLSTM layers, we choose ReLU for the TCN and LSTM layers. A dense layer with 100 units with the CTC loss predicts a sequence of class labels. Second, we implement an attention-based network (see Fig. 7) that consists of an encoder with batch normalization, 1D convolutional and (Bi)LSTM layers. These map the input sequence \(\mathbf {U} \in \mathbb {R}^{m \times l}\) to a sequence of continuous representations \(\mathbf {z}\). A transformer transforms \(\mathbf {z}\) using \(n_{\text {head}}\) stacked multi-head self-attention \(\text {MultiHead}(Q,K,V) = \text {Concat}(\text {head}_{1}, \ldots , \text {head}_{h}) W^{O}\) with \(W^{O} \in \mathbb {R}^{hd_{v} \times d_\text {model}}\). The attention consists of point-wise, fully connected time-distributed layers followed by a scaled dot product layer and layer normalization [5] with \(d_{\text {model}}\) output dimension [94]. \(\text {head}_i = \text {Attention}(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})\), where \(W_{i}^{Q}, W_{i}^{K} \in \mathbb {R}^{d_{\text {model}} \times d_k}\), and \(W_{i}^{V} \in \mathbb {R}^{d_{\text {model}} \times d_v}\). The attention can be described as mapping a set of key-value pairs of dimension \(d_v\) and a query of dimension \(d_k\) to an output, and is computed by \(\text {Attention}(Q,K,V) = \text {softmax}\Big (\frac{Q K^{T}}{\sqrt{d_k}}\Big )V\). The matrices Q, K and V are a set of queries, keys and values.

Data augmentation As the size of the datasets is limited, data augmentation is a critical pre-processing step for networks to prevent overfitting and improve generalization. However, it is not obvious how to carry out label-preserving augmentation in some domains, i.e., scaling of acceleration signals [93]. We apply the following different data augmentation methods for wearable sensor data on each sensor channel at 50% probability. Time warping perturbs the temporal location by smoothly distorting the time intervals between samples that, e.g., simulates different sampling rates through time shifts of the connection between device and tablet. Scaling changes the magnitude of the data in a window by multiplying by a random scalar \(\sigma = \pm 0.1\) that augments drifts in the signals. Shifting adds a random value \(\alpha = \pm 200\) to the force data and \(\alpha = \pm 20\) to the other sensor data. While jittering is a way of simulating additive sensor noise by adding a multiple \(\sigma = \pm 0.1\) of the standard deviation to all sensor channels, magnitude warping changes the magnitude by convolving the data window with a smooth curve varying around [0.7, 1.3] (only for the accelerometer data). For time and magnitude warping, the data are augmented by Bézier curves in the interval \([1-\sigma , 1+\sigma ]\) that are generated based on 10 random points. As one sample is represented by a sequence of characters and a sample cannot be split into sub-sequences, applying cropping and permutation augmentation is not possible. Figure 8 zooms into the augmented sensor data of the x-axis signal from Fig. 1.

5 Experiments

This section provides evaluation results for the seq2seq (Sect. 5.1) and the single character-based classification task (Sect. 5.2), and evaluates left-handed datasets (Sect. 5.3). We propose a writer-dependent evaluation in Sect. 5.4.

Table 4 Evaluation results (WER, CER) in % (mean and standard deviation) for our OnHW-equations, OnHW-words500(R), OnHW-wordsRandom and OnHW-wordsTraj writer-dependent (WD) and writer-independent (WI) datasets, and the publicly available IAM-OnDB [51] (line level) and VNOnDB-words [59] datasets

Hardware and training setup For all experiments we use Nvidia Tesla V100-SXM2 GPUs with 32 GB VRAM equipped with Core Xeon CPUs and 192 GB RAM. We use the Adam optimizer with a learning rate of \(10^{-4}\). We run each experiment for 1000 epochs with a batch size of 50 (unless stated differently) and report results for the best epoch. We split each dataset into five approx. 80/20 train/validation splits and report the mean and standard deviation of the WER and CER. We use our OnHW-equations, OnHW-words500(R), OnHW-wordsRandom and OnHW-wordsTraj as well as the IAM-OnDB [51] and VNOnDB-words [59] datasets for the sequence-based classification task, and the OnHW-symbols, split OnHW-equations and OnHW-chars [65] datasets for the single character-based classification task. Each model is trained from scratch for every dataset. We make use of the time-series classification toolbox tsai [61] that contains a variety of state-of-the-art techniques [6, 15, 21, 24, 25, 34, 39, 72, 87, 89, 99, 103, 110, 113].

5.1 Seq2seq task evaluation

Method and architecture evaluation We first evaluate our CNN and attention-based models for the seq2seq classification task. A summary of results is given in Table 4. For all datasets our CNN+BiLSTM model significantly outperforms the CNN+LSTM and CNN+TCN models. The attention-based model performs poorly on large datasets (OnHW-[equations, words500(R), wordsRandom]), but yields better results than the CNN+ TCN on our OnHW-wordsTraj camera-based dataset and outperforms the CNN+LSTM and CNN+TCN models on the trajectory-based dataset. The CNN+BiLSTM model achieves a very good CER of 1.78% on the OnHW-equations WD dataset that increases to 12.98% for the WI task. For the words, IAM-OnDB and VNOnDB datasets, the WI classification task is more difficult. While we achieve very low CERs, the WERs are higher as no lexicon or language model is used. While for the OnHW-wordsRandom dataset the CER of 7.87% for the WD task increases notably to 35.22% for the WI task, the difference for the OnHW-words500 dataset is smaller (17.16% CER for the WD task and 27.80% for the WI task) as words in the validation set do not appear in the training set (WD task). For the OnHW-words500R dataset, the CER decreases to 5.20% as the split is randomly shuffled. Our OnHW-wordsTraj dataset allows a comparison of three recording devices (i.e., trajectory, IMU and camera). From the CNN+BiLSTM model we see that the spatio-temporal trajectory-based classification task is easier than OnHWR from IMU-enhanced pens. Furthermore, it is challenging to learn the transformation from camera to paper plane.

Fig. 9
figure 9

CER of InceptionTime [25] for different interpolated time-series lengths on OnHW-equations dataset

Fig. 10
figure 10

Hyperparameter search of depth and nf for InceptionTime [25] with and without BiLSTM on the OnHW-equations datasets averaged over 5 splits

Fig. 11
figure 11

Hyperparameter search of nf for XceptionTime [72] with and without BiLSTM on the OnHW-equations datasets averaged over 5 splits

Fig. 12
figure 12

Hyperparameter search of nf for ResNet [103] and ResCNN [113] with and without BiLSTM on the OnHW-equations datasets averaged over 5 splits

Comparison to state-of-the-art techniques For comparison, we train nine different well-established time-series classification architectures on our OnHW datasets and InceptionTime [25] on the tablet datasets. For these methods we interpolate and zero pad the time-series to 800 timesteps to obtain a fixed sequence length. We use linear spline interpolation. In total, 800 timesteps lead to a low CER (see Fig. 9), while above 800 timesteps the training time significantly increases. As these methods are introduced for classifying single labels (not sequences of labels), we replace the last linear layer with a max pooling layer (of kernel size 4), a dropout layer (40%) and an 1D convolutional layer (kernel size 1 and channels are the number of class labels). Similar to our approaches, we further add two BiLSTM layers each of size 60. InceptionTime is an ensemble of CNNs inspired by Inception-v4. As its default parameters (nf of 32 and depth of 6) lead to inferior performance compared to our methods, we perform a large hyperparameter search for depth (between 3 and 12) and nf (16, 32, 64, 96 and 128) with and without BiLSTMs for the WD and WI tasks (see Fig. 10). On the WD dataset, a higher nf and greater depth leads to a lower CER. For the WI task, the model tends to overfit on specific writers for larger models, and hence, the error rates are constant for nf between 64 and 128, while the CER still decreases for a greater depth. For nf of 32 and depth of 11, InceptionTime+BiLSTM can marginally outperform our CNN+BiLSTM model on the OnHW-equations dataset (1.65% CER WD and 11.28% CER WI) and is notably better on the OnHW-words500 (WD) dataset (12.96% CER) without the two additional BiLSTM layers, but is on par with our CNN+BiLSTM model on the WI task (26.08% CER) and yields marginally higher error rates on the random splits. Results further suggest that the performance strongly depends on the network size. XceptionTime [72] consists of depthwise separable convolutions and adaptive average pooling to capture both temporal and spatial contents. We search for the hyperparameter hf (see Fig. 11) and set \(nf = 144\). The small FCN [103] model yields high error rates, but ResNet [103] (based on FCN) enables the exploitation of class activation maps to find contributing regions in the raw data and improves FCN. ResCNN [113] integrates residual networks with CNNs. We set also \(nf = 144\) for ResCNN and ResNet (see Fig. 12), which perform similar, but cannot outperform XceptionTime on our datasets. While additional BiLSTM layers improve the results of InceptionTime, the error rates for XceptionTime, ResNet and ResCNN decrease with additional BiLSTM layers. The univariate models LSTM-FCN [39] and GRU-FCN [21] as well as the multivariate models MLSTM-FCN [40] and MGRU-FCN [40] that augment the fully convolutional block with a squeeze-and-excitation block improve the FCN results, but are not complex enough to outperform other architectures on our datasets. In general, word beam search [77] did not improve results and even leads to degraded performance. See Appendix 7 for more evaluation details and a comparison to state-of-the-art techniques.

Table 5 Evaluation results (WER, CER) in % (mean and standard deviation over fivefold splits) for different augmentation techniques and sensor choices for the OnHW-equations dataset

Influence of data augmentation We train the CNN+ LSTM model on the OnHW-equations dataset with the augmentation techniques described in Sect. 4. Results are given in Table 5. The baseline WER of 22.96% (WD) can be improved with all augmentation techniques, while the WI error of 69.21% is only affected by time warping and jittering. The most notable improvement is given by time warping with 20.90% for the WD task and 64.10% for the WI task. Interpolation to 1,000 timesteps did not improve the accuracy, and normalization to \([-1, 1]\) deteriorates training performance. Figure 13 shows augmentation results and combinations of these for InceptionTime on the OnHW-equations WD dataset. Here, the baseline CER of 1.77% and WER of 12.94% can be notably improved by time warping as a single augmentation (comparable to our CNN+LSTM). The combination of jittering and time and magnitude warping yields the highest error rate reduction.

Fig. 13
figure 13

Augmentation results for InceptionTime on the OnHW-equations (WD) dataset over five splits

Influence of sensor dropping We train the OnHW-equations dataset and drop data from single sensors, e.g., the front or rear accelerometer or the magnetometer data, in order to evaluate the influence of each sensor, see Table 5. Only dropping the front accelerometer (WD) and the rear accelerometer (WI) decreases the WER and CER, which could also be attributed to the smaller dataset size while leaving the architecture unchanged. Without magnetometer the WER improves for the WI task as the magnetic field changes with the recording location, but keeps constant for the same writer. Dropping the force sensor leads to a significant higher classification error as the force sensor provides information that allows a segmentation of strokes.

5.2 Single character task evaluation

Table 6 Recognition rates (CRR) in % for the symbols, split equations and characters WD and WI datasets
Table 7 Recognition rates (CRR) in % for the symbols, split equations and characters WD and WI datasets for the CNN+BiLSTM architecture trained with different loss functions

Method and architecture evaluation We use our OnHW-symbols and split OnHW-equations datasets, the combination of both (samples randomly shuffled) and the OnHW-chars [65] dataset, and interpolate the single characters to the longest single character of the dataset (64 for characters and 79 for number/symbols). We train our network proposed in Fig. 6 with one additional dense layer of 100 units. For all methods we use the categorical CE loss for training and the CRR for evaluation. Network parameter choices are described in Appendix 6. The results are summarized in Table 6. We also compare to state-of-the-art results provided in [65]. While GRU [15] yields very low accuracies for all datasets, standard LSTM units (2 and 3 stacked layers), BiLSTM units and TCNs can increase the CRR. Further, FCN [40] and the spatio-temporal variants RNN-FCN [40], LSTM-FCN [39] and GRU-FCN [21] as well as the multivariate variants MRNN-FCN, MLSTM-FCN [40] and MGRU-FCN [40] yield better results. MLSTM-FCN [40] with a standard or attention-based LSTM and with/out a squeeze-and-excitation (SE) block achieves high accuracies, but cannot improve over state-of-the-art results achieved by [65]. Due to minor and inconsistent changes in performance, it is not possible to make a statement about the importance of the SE block and the attention-based LSTM. The networks based on CNNs, i.e., ResCNN [113], ResNet [103], XResNet [34], InceptionTime [25] and XceptionTime [72], can partly outperform the FCN variants. For XResNet, a smaller depth of the network is preferable, while for InceptionTime the greater depth and larger nf generally yields better results. We train TapNet [111], an attentional prototypical network for semi-supervised learning, that achieves the lowest accuracies. We propose a benchmark for the transformer variants [13, 36, 44, 90, 101] (for details, see Appendix 6). The performance improves for all transformer variants compared to TapNet, but are notably lower than those of the convolutional and spatio-temporal methods. TST [110] with Gaussian encoding is on par with the convolutional techniques on the WD datasets. While our CNN+BiLSTM outperforms all methods on all OnHW-chars [65] datasets, it is not notably different from results achieved by the CNN+LSTM and CNN+TCN architectures, which in turn achieve the best results on the OnHW-symbols and split OnHW-equations datasets as well as on the combined cases.

Loss functions evaluation We train the CNN+BiLSTM architecture for all single-based datasets with the CCE loss as baseline and the seven variants described in Sect. 4. For FL, we search for the optimal hyperparameters for the OnHW-chars combined dataset and for the other methods for the OnHW-symbols dataset (see Appendix 6). We set \(\alpha = 0.75\) and \(\gamma = 8\). From the hyperparameter searches and literature recommendation, we set \(\beta =0.1\) for LSR, \(\beta =0.95\) for SBS, \(\beta =0.8\) for HBS, and \(\alpha =0.95\) for GCE. For the SCE loss, we set \(\alpha =0.5\) and \(\beta =0.5\) for the weighting of the CCE and RCE losses, respectively. Similar, the regularization terms of the JO loss are weighted by \(\alpha =1.2\) and \(\beta =0.8\). Table 7 gives an overview of the results for all loss functions for all single character-based datasets. The FL improves the CRR results of the symbols and equations datasets (WI) in comparison with the baseline, but yields worse results for the other datasets. As characters in the OnHW-chars dataset are equally distributed, the FL does not have any benefit on training performance. LSR prevents overconfidence and increases the accuracy for all datasets. LSR also achieves the highest accuracy of all losses for eight of the 12 datasets. As there are many samples that are written similarly, the model is overconfident for such samples by integrating a confidence penalty. Similar to FL, the SBS and HBS losses can only marginally improve results for symbols and split equations datasets, and even decrease performances for the character datasets. HBS is slightly better than SBS. The GCE loss decreases the classification accuracy for the OnHW-chars datasets, while it achieves the second best CRR of all losses for the split OnHW-equations WD (95.81%) and WI (86.46%) datasets. Yet, the GCE loss often results in NaN loss (see Fig. 25, Appendix 7), and hence, is non-robust for our datasets. The improvement for the SCE loss is less significant than other losses and even decreases for the OnHW-chars dataset. JO leads to an improvement for all OnHW-chars datasets. JO further outperforms all losses for the WI upper task and achieves marginally lower accuracies than the LSR loss for the lower and combined datasets. LSR also achieves the highest accuracies on the OnHW-symbols WD (97.33%) and WI (82.17%) datasets. In summary, all loss variants can improve results of the CCE loss for the OnHW-symbols, split OnHW-equations and combined datasets as these are not equally distributed. LSR, SCE and JO can most significantly outperform other techniques. For more details of accuracy plots, see Appendix 7, Fig. 25.

Table 8 Evaluation results (WER, CER) in % (mean and standard deviation) for our left-handed OnHW-equations-L, OnHW-words500-L and OnHW-wordsRandom-L datasets (left), and recognition results (CRR) in % for our left-handed OnHW-symbols-L, split OnHW-equations-L and OnHW-chars-L datasets (right) for the CNN+BiLSTM architecture
Fig. 14
figure 14

Evaluation of the ED dependent on the normalized sample lengths for the OnHW-equations dataset

Fig. 15
figure 15

Evaluation of the ED dependent on the normalized sample lengths for the OnHW-wordsTraj dataset

5.3 Left-handed Writers datasets evaluation

For the left-handed writers datasets, we use the pre-trained weights from the right-handed datasets and train the CNN+BiLSTM architecture for 500 epochs. Table 8 summarizes all results for the sequence-based classification task (left) and the single character-based classification task (right). The motion dynamics of right- and left-handed writers is very different, especially with different rotations, and hence, also the sensor data are different [45]. The models can still make use of the pre-trained weights, and fine tuning leads to 1.24% CER for the OnHW-equations-L dataset for the WD task, and 15.32% CER for the OnHW-words500-L dataset, which is better than for the right-handed task. For the OnHW-wordsRandom-L dataset, the CER (5.40%) increases, while the WER (32.73%) decreases. Consistently, the results for the WI task decrease as the model overfits to specific writers due to the small amount of different left-handed writers in the training set. For single-based datasets, the fine tuning leads to a high WD classification accuracy of 92% for the OnHW-symbols-L and split OnHW-equations-L datasets (compared to 96.2% and 95.57% for right-handed datasets, respectively), but decreases for WI tasks to 54% and 51.5% (compared to 79.51% and 83.88% for right-handed datasets, respectively). Due to the smaller size of the left-handed datasets, the models overfit to specific writers [45].

Fig. 16
figure 16

Writer-dependent CER (%) for the OnHW-equations (WD) dataset

5.4 Edit distance and writer analysis

Evaluation of sample length-dependent edit distance We show the sample length-dependent counts of wrong predictions, i.e., mismatches, insertions and deletions, for the OnHW-equations (see Fig. 14) and OnHW-wordsTraj (see Fig. 15) datasets. For the OnHW-equations dataset, a high appearance of mismatches and insertions appears at the starting and end characters, while deletions emerge more even over the whole equations. The first character of words is significantly often mismatched or has to be inserted or deleted for the OnHW-wordsTraj dataset. This shows the unequal distribution of samples for the words datasets (see Fig. 4c), while the equations dataset is very equally distributed (see Fig. 4b).

Writer-dependent evaluation Figure 16 shows the writer-dependent evaluation of the OnHW-equations dataset. The CER of many samples of several writers, e.g., ID 0, 2-4, 24-35, 42-44, and 49-53, is 0%. The CER increases only for a small number of samples. The range of the CER increases for writer IDs 1, 5-7, 22, 23, and 36-39. Hence, the writing style and with that the sensor data is different and out-of-distribution in the dataset.

6 Discussion and summary

6.1 Social impact, applications and limitations

Handwriting is important in different fields, in particular graphomotoric. The visual feedback provided by the pen, for instance, helps young students and children to learn a new language. Hence, research for HWR is very advanced. However, state-of-the-art methods to recognize handwriting (a) require to write on a special device, which might adversely affect the writing style, (b) require to take images of the handwritten text, or (c) are based on premature technical systems, i.e., the sensor pen is only a prototype [17]. The publicly available sensor pen developed by STABILO International GmbH has previously been used by [46, 65] and allows an easier data collection than previous techniques. The research for collecting devices which do not influence the handwriting style is becoming increasingly important and with it also the social impact of resulting datasets. The aim of our dataset is to support the learning of students in schools or self-paced learning from home without additional effort [4, 106]. A well-known bottleneck for many machine learning algorithms is their requirement for large amounts of data samples without under-represented data patterns. For our HWR application, a large variety of different writing styles (cursive or printed characters, left- or right-handed and beginner or advanced writers), pen rotations and writing surfaces (especially different vibrations of the paper) are necessary. We provide an evaluation benchmark for right- and left-handed datasets. As motion dynamics between right- and left-handed writers are very different, extracting mutual information is a challenging task [45, 63]. The ratio between both groups approximately fits the real-world distribution, i.e., the under-representation of left-handed writers (10.6%). Only adults without any selection participated at data recording as the handwriting style of students changes quickly with the age [7].

6.2 Experimental results

We performed several benchmarks and come to the following conclusions: (1) For the seq2seq classification task, we evaluated several methods based on CNNs in combination with RNNs on inertial-based datasets written on paper and on tablet and evaluated state-of-the-art trajectory-based datasets. Depending on the dataset size, our CNN+BiLSTM model is on par with the InceptionTime+BiLSTM architecture. A search of architecture hyperparameters is important to achieve a generalized model for a real-world application. Our transformer-based architecture could not outperform simpler convolutional models. (2) Sensor data augmentation leads to a better generalized training. (3) For the single classification task, our simple CNN+[LSTM, BiLSTM, TCN] can outperform state-of-the-art techniques. (4) Cross-entropy variants (i.e., label smoothing) improve results that are dependent on the dataset (i.e., label noise and class balance). (5) Writer-independent classification of (under-represented) left-handed writers is very challenging that is interesting for future research.

6.3 Collection consent and personal information

While recording the datasets, we collected the consent of all participants. We only collected the raw data from the sensor-enhanced pen, and for statistics the age and gender of the participant and their handedness. The handedness is necessary because the pen is differently rotated between left- and right-handed writers. The recording localization was Germany. An ID is assigned to every participant such that the dataset is fully pseudonymized. The ID is necessary for the WD and WI evaluation.

6.4 Conclusion and future research

We proposed several equations and words OnHWR datasets for a seq2seq classification task, as well as one symbol dataset for the single character classification task based on a novel sensor-enhanced pen. By utilizing (Bi)LSTM and TCN models combined with CNNs and different transformer models, we proposed a broad evaluation benchmark for lexicon-free classification. Various augmentation techniques showed notable improvement in classification accuracy. Our detailed evaluation of the WD and WI tasks sets important challenges for future research and provides a benchmark foundation for novel methodological advancements. For example, semi-supervised learning and few-shot learning such as prototypical networks could improve the classification accuracy of under-represented writers. Exploiting offline datasets for pre-training or the use of lexicon and language models might further allow the model to better learn the task.