Keywords

1 Introduction

Modern edge communicating devices are built around touch-sensitive display panels equipped with handwriting recognition systems. These systems are of great assistance eschewing the need for structured UIs such as virtual keyboards that are often slow and error-prone while also distant to the natural handwriting experience with pens.

In this context, online recognition of glyphs (as opposed to offline that takes a graphical image representation as input) refers to the problem of mapping spatio-temporal samplings of user gestures corresponding to handwritten text into a symbolic representation. Each 3-dimensional sample individuates a touch. A coherent and consecutive sequence of touches defines a stroke that can be combined to form glyphs. Glyphs correspond to characters or symbols encoded in a language vocabulary. In this work, we will consider the online input of mathematical arithmetic expressions as a formally correct sequence of gestures of numerals, operators and symbols. Note that some numerals and operators may require more than one stroke to be represented as depicted in Fig. 1. Table 1 formalizes the terminology adopted in this work.

Table 1. Terminology

Gesture recognition applications must solve several problems at once, namely: i) feature extraction in a multi-dimensional spatio-temporal space, ii) segmentation of stroke sequences into glyph items, iii) glyph segmentation with the aim of numeral recognition, and iv) the encoding of expression rules and patterns to form a correct symbolic output. An example of an online gesture sequence is shown in Fig. 1.

With mathematical expressions, users often wish to go beyond the mere recognition of glyphs and hope for additional tasks to be performed such as automatic evaluation, step-by-step simplification or listing of equivalent forms. The Expression Tree (ExpTree) formalism [1] was introduced to represent mathematical expressions as binary trees and consequently resolve all equivalent forms to some unique representation, scheduling its evaluation by transforming an input symbol list into a computation graph. In particular, a post-ordered traverse of tree generates the Reverse Polish Notation (RPN) using postfix notation, a unique representation that postpones operators, crushing the need for brackets.

Main Contributions

  1. (i)

    We propose a new dataset for handwritten expressions (cf. Sect. 3) obtained from several hundred users and suitable for a wide range of supervised and unsupervised machine learning applications.

  2. (ii)

    We study the ability of an attention mechanism to learn and represent implicit structures of spatio-temporal gesture data, even when the underlying syntax is not enforced (in the loss computation or model architecture).

  3. (iii)

    We prove the power of Transformers not only as language models but also as a solution to several sequence mapping tasks, demonstrating transfer learning behaviours of the encoder on unseen glyphs from online gesture inputFootnote 1.

  4. (iv)

    We propose a small footprintFootnote 2 topology for end-to-end online mathematical expression recognition and ExpTree generation, with fast optimisation, very high accuracy and suitable for edge inference.

  5. (v)

    We test the model robustness on ablated input, showing its ability to generate compliant RPN expressions even in case of missing strokes.

  6. (vi)

    We show the multi-level segmentation capability of the attention mechanism, highlighting the correlation between syntactically correct predictions and explainability in cross-attention visualisation.

Fig. 1.
figure 1

Online gesture example of a stroke sequence (a) for the mathematical expression (b) and its corresponding RPN expression tree (c). Each cell in (a) depicts the linear interpolation of spatio-temporal points that forms an input token. Green and red cells denote the \(\langle \texttt {bos} \rangle \) and \(\langle \texttt {eos} \rangle \) respectively. Glyph segmentation, numeral segmentation and RPN ExpTree parsing are colour coded with blue, green and red respectively. (Color figure online)

2 Related Work

The field of Handwriting Text Recognition (HTR) consists on a set of techniques and algorithms that aim at generating text directly from handwritten inputs. Most HTR systems [2] work on offline data due to dataset availability [3]. With the current popularity of the attention mechanism [4, 5], the field remains in constant development. However, as noted in [6], the temporal dimension provides some valuable additional information that may simplify stroke segmentation and avoid recourse to complicated regression strategies such as text-line segmentation [7]. As a result, online methods may expect superior performance over offline counterparts as reported in a 2014 extensive survey of online HTR methods [6]. Further progress has since been observed, with much effort and resources employed on improving existing techniques [8, 9].

In this context, Handwriting Digit Recognition (HDR) remains a popular HTR sub-problem still actively researched using both offline [10] and online methods [11]. In particular Handwritten Mathematical Expression Recognition (HMER) consists in the generation of mathematical expressions using formal syntaxes such as . State-of-the-art HMER models have reached impressive levels of accuracy, particularly when exploiting attention [12] and combining potentialities of online and offline data [13]. However, although predictions are mostly correct, these models fail to learn the intrinsic structure of the mathematical expression. In contrast, learning a tree representation provides a more natural form [14] and can be achieved with an RNN encoder and a HMER tree decoder to explicitly represent the tree formalism.

We propose to push this challenge further, leaving the task of learning implicitly the RPN syntax to the model, and doing this by relying on the attention mechanism embedded in a Transformer framework [15]. This provides a powerful sequence mapping architecture entirely based on the attention mechanism [4], eschewing recourse to recurrent or convolution layers, hence allowing for significant parallelisation and unattenuated gradient flow. This topology currently stands as the state-of-the-art on almost all NLP tasks [16,17,18], but also on a wider and more generic group of sequence transduction problems [19,20,21,22,23]. The Transformer popularity saw many experiments revisiting its design with several optimized architectures being proposed [24,25,26,27]. However very few are capable of clearly outperforming the original topology. As a result this work will follow the seminal Transformer proposal of [15].

3 Dataset

An important contribution of this work is that of an online gesture dataset of mathematical expressions suitable for investigating several tasks such as Handwriting Character Recognition (HCR), HDR or HMER, but also touch, stroke or glyph segmentation, automatic result computation, unsupervised generation or eventually, ExpTree building. Our handwritten database is presented as a coherent collection of tables composing a SQL Schema with spatio-temporal data for arabic numerals [11] and mathematical symbols, collected from volunteers writing on touch panels. This stage saw the contribution of 455 subjects for a total of 21 752 labelled glyphs composed by 27 477 strokes, thus over 700 thousand touches. The dataset can be used at different levels of granularity, namely touch, stroke and glyph.

Subjects have been split into training, validation and test sets (60/20/20 proportions) such that models were tested on unseen handwriting styles to ensure accurate estimation of the generalisation power. In addition, strokes were also randomly augmented and composed to form expressions.

An expression (E) is defined as a bounded sequence of numerals (N) and operators (\(\odot \in \{+,-,\times ,\div \}\)). The generation of expressions is carried out according to the following grammar:

  1. 1.

    an expression can be a numeral: \(E\rightarrow N\)

  2. 2.

    an expression can be a binary operation: \(E\rightarrow E \odot E\)

  3. 3.

    an expression can be a binary operation between brackets: \(E\rightarrow (E \odot E)\)

As a supplementary rule, every expression must end with the ‘=’ symbol. For each expression, we provide 3 ground truth labels (namely ASCII text, RPN tree and numerical evaluation), for a total of 240 000 samples split as specified above. In this work, we report results at the stroke level, leaving to the model the burden of glyph segmentation.

4 Transformer Architecture and Experimental Details

Our model leverages the original Transformer architecture [15]. However crucial modifications are introduced to work with spatio-temporal data. Given some input sequence, \(X\in \mathbb {R}^{d_\textrm{f}\times n}\), of n stroke tokens defined as interleaved spatio-temporal data with zero-padding of fixed-length \(d_\textrm{f}\) (maximum of 64 (xy) touch samples per stroke, appropriately \(\langle \texttt {bos} \rangle \) prefixed, \(\langle \texttt {eos} \rangle \) suffixed and \(\langle \texttt {pad} \rangle \) padded), a mask \(M_x\) is computed to ensure encoder’s attention is only paid on valid online data tokens.

As the input is composed of spatio-temporal information corresponding to touches, each encoder token embeds a stroke as \(d_\textrm{f}\) scalars (cf. Fig. 1) resulting in the identification within a potentially unbounded input vocabulary and therefore eschewing any form of embedding.

Positional encoding provides a strategy to embed the positional information of input tokens in the encoder, a necessary operation since the attention mechanism has no built-in concept of sequentiality. Frequency modulation is proposed in [15]. However, since we observed no performance gain with such a strategy, we use a learnable 1D embedding based on the incremental index of the token. Stroke positions are encoded in \(P_x\in \mathbb {R}^{d_\textrm{f}\times n}\).

The encoder is trained to learn some latent sequence representation \(Z={{\,\textrm{Enc}\,}}(X +\alpha P_x, M_x) \in \mathbb {R}^{d_\textrm{a}\times d_\textrm{h}\times n}\) where \(\alpha \) is a scaling factor blending the input data and positional information, \(d_\textrm{a}\) the number of attention heads and \(d_\textrm{h}\) the hidden state dimension of the attention heads. The encoder consists of a stack of \(l_e\) identical multi-head vanilla self-attention layers and a positional feed-forward network of dimension \(d_\textrm{p}\). Each layer is followed by a residual connection before layer-normalisation.

In this work, we explored the transfer learning capabilities of the encoder that was never trained from scratch but relied on an optimised snapshot, pre-trained in conjunction with a language modelling decoder using a large corpus of English sentences [28] that contained almost no digits and arithmetic operators (classified as \(\langle \texttt {unk} \rangle \) tokens). This transfer learning strategy resulted in considerable speed-up during training and model optimisation. We use a frozen encoder with \(\mathrm {\Theta }_\textrm{e}\) parameters as a feature extractor on this new domain.

The decoder generates a causal sequence of tokens in an auto-regressive manner given some vocabulary and relative token encoding. It is initialised with the \(\langle \texttt {bos} \rangle \) token and iteratively outputs a new token using greedy sampling of the decoder’s softmax output until the \(\langle \texttt {eos} \rangle \) token is predicted or the maximum sequence length, m, is reached. The decoder also consists of \(l_d\) identical layers, each composed by: i) a masked self-attention layer that prevents the decoder from peeking at the subsequent tokens, ii) a cross-attention layer that attends over the encoder output Z to generate predictions, and iii) a feed-forward layer as in the encoder but of dimension \(3d_\textrm{p}\).

At each step, the decoder’s input is an auto-regressive sequence of tokens mapped into an embedding layer with positional encoding, and used to predict the next token of the output sequence. All \(\mathrm {\Theta }_\textrm{d}\) parameters of the decoder were trained from some randomly initialised state.

Experimental details: models were all configured with \(d_\textrm{f}= d_\textrm{a}\times d_\textrm{h}= d_\textrm{p}= 128\). For v\(\,_{\!1-5}\), \(n = 2\,m = 24\). For v\(\,_{\!10-11}\), \(n = 2\,m = 48\). The encoder has \(\mathrm {\Theta }_\textrm{e}=523\,520\) parameters and decoder has \(\mathrm {\Theta }_\textrm{d}=934\,136\) parameters. Models were trained on Nvidia TitanX GPUsFootnote 3, for a maximum of 200 epochs, using cross-entropy loss and Adam optimiser with a decay schedule (initial learning rate of \(8\times 10^{-4}\) and halving every 30 epochs).

Table 2. Expression recognition, model hyper-parameters and dataset configuration. Performance is reported in term of Cross-Entropy Loss (XEL) and normalised Levenshtein distance Accuracy (LA). Model v\(\,_{\!4}\) trained on larger expressions using 4 Heads (H) in a 4 Layer (L) decoder performs best.

5 Experimental Results

A series of experiments were carried out to investigate two different problems, namely: (1) expression recognition in glyph sequences and (2) ExpTree recognition in RPN forms. The first task involves the recognition of a sequence of glyphs composing an arithmetic expression from stroke input as time series. The second task requires further understanding of symbolic syntax and semantics through the construction of an ExpTree using postfix notation.

Models are evaluated using a number of performance metrics on the test sets and results are reported in terms of: (a) Cross-Entropy Loss (XEL), (b) normalised Levenshtein distance [29] Accuracy (LA), (c) Character Error Rate (CER), and where applicable (d) RPN Accuracy Range (cf. Sect. 5.2). The LA and CER are both accuracy metrics based on edit distance.

5.1 Expression Recognition

In this set of experiments, models v\(\,_{\!1-3}\) are trained to output glyph sequence of simple arithmetic expressions in the absence of brackets while model v\(\,_{\!4}\) adds groups of terms with brackets. Table 2 summarises training datasets, model hyper-parameter configuration and performance evaluation in this experiment.

Table 3. ExpTree recognition for various model hyper-parameters. Performance is reported in term of Cross-Entropy Loss (XEL), normalised Levenshtein distance Accuracy (LA), Character Error Rate (CER), and RPN Accuracy Range (RAR). Models trained on 240k expression datasets. Fine-tuned model v\(\,_{\!11}\) with \(\langle \texttt {eon} \rangle \) for numeral segmentation provides best performance.

We observe that there are no clear benefits in increasing the number of decoder heads in the absence of brackets (models v\(\,_{\!2-3}\)). However, despite an increase of vocabulary size and, in principle, also some decoding complexity, the addition of brackets resulted in better performance as seen in model v\(\,_{\!4}\). This model is capable of learning some non-trivial valuable syntax rules such as number of ‘(’ should match that of ‘)’, or an operator can never precede a ‘)’.

5.2 Expression Tree Recognition

The ExpTree recognition task requires of an additional step to glyph recognition with the construction of an RPN form. In this set of experiments, model performance is also evaluated in terms of CER and RPN Accuracy Range (RAR) defined as the range \([1-V_\ell ^\textsf {max}, 1-V_\ell ^\textsf {min}]\), where \(V_\ell \) stands for violation loss. If \(v_i\) denotes the count of violations in the i-th expression, \(V_\ell ^\textsf {min} = \frac{1}{N}\sum _{i=1}^N \mathbbm {1}_{v_i>0}\) and \(V_\ell ^\textsf {max} = \frac{1}{N}\sum _{i=1}^N v_i\), where N is the test set cardinality. Referring to the standard infix to postfix conversion algorithm in [1], a violation occurs every time the stack is in an inconsistent state while conversion is performed.

This does not required the initialisation of stack operations to be determined. Instead one can linearly scan the output using a counter, incrementing its value for a push, decrementing it for a pop. Counter value should be 1 at the end and never become negative. Adding the number of times a negative value is observed to the absolute value of the final counter minus 1 defines the number of violations.

Table 3 summarises experimental results on ExpTree predictions. Models v\(\,_{\!5,\,10-11}\) were trained on the same dataset size as v\(\,_{\!2-4}\) (240k expressions), with the possible inclusion of brackets. The v\(\,_{\!5}\) model training dataset further constrained numerals to contain at most one decimal digit. This restriction was lifted in training sets associated with models v\(\,_{\!10-11}\). As a result, an end-of-numeral token, \(\langle \texttt {eon} \rangle \), was added to the decoder’s output vocabulary for learning an additional numeral segmentation task of RPN forms.

Fig. 2.
figure 2

Cross-attention plots. In (a), output tokens ‘.’ (decimal mark) and \(\langle \texttt {eon} \rangle \) (end-of-numeral) can be seen tracking the previous digit; in (b), output token ‘=’ is attending token \(\langle \texttt {eos} \rangle \).

With the same hyper-parameter configuration of Table 2, an expected degradation in performance is observed for model v\(\,_{\!5}\) on this more complicated task. The addition of the \(\langle \texttt {eon} \rangle \) token in model v\(\,_{\!10}\) showed some significant improvement in accuracy, outperforming our best results for simple expression glyph recognition. Despite the use of a larger vocabulary size for the decoder’s output, the addition of a specific token to model explicitly the language semantic of numerals is observed to yield higher accuracy once again. The new token forces the network to learn a pattern resulting in better numeral segmentation and improved performance.

In Sect. 4 we proposed to test the transfer learning capabilities of the encoder, using frozen parameters on a new domain. Excellent results have been observed, demonstrating the encoder’s ability to correctly segment and combine strokes generating latent representations that are generic enough to be valuable for any downstream tasks even when used with completely different output vocabulary.

However, further improvement can be reached with fine-tuning of all parameters as observed with model v\(\,_{\!11}\) that leveraged frozen encoder weights of model v\(\,_{\!10}\), introducing the concepts of digits or operators for the first time. Final model achieves 94% on the Normalised Levensthein Accuracy, with a Character Error Rate lower than 7%, generating on average 94% of strings compliant to the RPN, while mean number of violations per output expression is only 0.067.

6 Attention Visualisation and Output Distributions

Visualisation of attention mechanisms provides some interesting insights in the learning process. Figure 2a depicts the cross-attention weights that the decoder puts over the encoder’s output. It shows that head 1 of layer 1 is responsible for numeral segmentation. For every \(\langle \texttt {eon} \rangle \) or decimal mark tokens, this head has learned to attend over the stroke of the previous digit. In Fig. 2b, head 4 of layer 3 attends over the \(\langle \texttt {eos} \rangle \) token while predicting the ‘=’ token demonstrating that the model has successfully learned the syntax rule ‘every expression must end with ‘=’ symbol’.

Figure 3 shows the confusion matrix over the decoder’s vocabulary for the average probability distribution of the output softmax. This provides some insight into model mispredictions leading to errors. In Fig. 3a, model v\(\,_{\!10}\) leveraged a frozen encoder pre-trained on a completely different output vocabulary with no digits and operators. The model confuses ‘2’ with ‘3’ and, to a lesser extent, operator ‘-’ with ‘+’ since the latter is often written with an horizontal stroke. Figure 3b shows that fine-tuning the encoder in model v\(\,_{\!11}\) results in better performance and improved diagonality, which also justifies the greedy decoding strategy used in our decoder.

Fig. 3.
figure 3

Softmax distribution mean of the decoder’s output predictions showing the probability mass for all token pairs. Frozen model v\(\,_{\!10}\) in plot (a) reveals decoding errors caused by confusion between digits 2 and 3, and also between operators. Model v\(\,_{\!11}\) in plot (b) shows that fined-tuning on all glyphs reduces confusion dramatically.

7 Model Robustness

Model robustness is investigated by means of ablation studies; strokes are removed from the input sequence to observe the model’s ability to enforce domain rules even when it is fed with incorrect expressions.

Table 4. Model robustness: ablation experiments with input strokes elided from input and corresponding to the equal sign in rows 1–3, a closing bracket in row 4 and an operator in rows 5–6. Metric is the Levenshtein Distance (LD).

Equal sign ablation: in our dataset, every expression to be considered syntactically correct must end with ‘=’. The learning of this rule is assessed by observing the inference results of models v\(\,_{\!4,\,10-11}\) when strokes representing the equal sign are omitted in the encoder’s input. All three models are able to make the correct inference, inserting the missing ‘=’ in decoder’s output as shown in rows 1–3 of Table 4.

Closing bracket ablation: in any correct plain expressions, the number of ‘(’ should match that of ‘)’. This syntactic rule is investigated in model v\(\,_{\!4}\) that was trained to recognise glyphs of an expression (not possible with models v\(\,_{\!10-11}\) as RPN forms eschew the use of brackets). When the stroke of a closing bracket was removed from the encoder’s input, the model acknowledges the input error and inserts the missing bracket in the output as shown in row 4 of Table 4. Of course, the exact position is not always guessed correctly, but the symbol is predicted so that to ensure syntax correctness of the output.

Operator ablation is investigated on models v\(\,_{\!10-11}\), where an operator’s strokes is removed from the input as shown in rows 5–6 of Table 4. To ensure ExpTree correctness when using postfix notation, an output expression must be terminated by an operator and its total number of operators always be a unit lower than the cardinality of operands. Both models appear to have learned this rule and are able to infer the presence of additional operator at the end (actual operator can only be guessed).

8 Conclusion

This work proposed a Transformer network for mathematical expression tree building from online input gesture data corresponding to handwritten strokes of digits and mathematical symbols. The encoder’s input was modified to receive spatio-temporal data as real-valued tokens. It can directly operate at stroke level without the need for mapping on a fixed input vocabulary. Model can predict ExpTrees by handling internally the multi-level segmentation of inputs (at glyph and numeral levels) and also understanding and learning how to represent and enforce syntactic and semantic rules of data. In addition, index positional encoding was shown to be as effective as cosine modulation yet standing as a simpler and more natural encoding for the position information. The Transformer’s ability to generate complex representations and learn non-trivial input/output mapping between sequences is well known [16, 19]. However the challenge was further pushed in this work with no ad hoc solutions to represent syntax or semantic rules and the absence of an engineered loss computation and model architecture. In addition, the encoder was trained on a completely different domain [28] and used as a frozen feature extractor in most experiments. Such transfer learning capabilities suggest that the encoder can create general latent representations suitable for problems of different nature, reducing the overall number of model parameters. The objective of this work is not so much to push out some state-of-the-art model but rather to state some important considerations that may be the starting points for future works in language modelling. Neural Machine Translation may be extended in this way to online data at different granularity levels, with no need for separate input segmentation or complex positional embeddings. Finally, pre-trained encoders could be effectively leveraged with transfer learning on different domains without fine-tuning or explicit domain adaptation, accelerating training for new problem classes where computational power/time or dataset size is limited.