1 Introduction

In recent years, the development of online educational applications has been greatly accelerated, especially after the covid-19 outbreak. Automatically scoring the students’ answers in online education applications can ease the burden on teachers of marking a large number of repetitive assignments. However, in Math Word Problems (MWPs), the performance of automatically solving is far from perfect and cannot yet be applied in practice. MWPs is a mathematical problem described in natural language. At the end of the MWPs’ text, the answer to the mathematical problem is required. MWP solving is very helpful to cultivate students’ mathematical ability of problem analysis and calculation. Table 1 is an example of a typical MWP where the reader is required to answer the number of motorcycles in the parking lot. In automatic MWPs solving task, a machine must deduce an answer to a given mathematical problem by acquiring the implied numeric information in the problem.

Table 1 A math word problem

In the 1960s, some researchers began to adopt Artificial Intelligence (AI) methods to solve MWPs. Recently, Wang et al. used a modified Seq2Seq model to build the MWPs Solver [1] and more and more researchers proposed a variety of approaches based on deep learning for solving MWPs [2,3,4,5,6]. Deep learning-based methods turn MWP solving into a natural language generation task, where the generated text is not “Natural”, but consists of mathematical symbols. The crucial advantage of the Deep Learning-based models is that they eliminate the need for hand-crafted features.

Nowadays, most SOTA models use the decoder of Goal-Driven Tree-structured MWPs Solver (GTS) proposed by Xie and Sun [4] for model decoding and token prediction [5,6,7]. The goal-driven mechanism in human problem solving is proposed by GTS. When humans read the text of a math word problem, they first figure out which target quantity is to be derived as the goal, and then pay attention to the relevant information of the problem which can help to realize the goal [4]. The decoding process of GTS’ decoder follows the generation process of the binary tree and each node corresponds to one goal to be solved.

Figure 1 shows a typical example of a mistake made by the GTS when solving the math problem in Table 1. The token “3” predicted by goal vector q7 shows that root goal vector q1 wants to solve the problem by template “[Total number of motorcycles’ wheels] ÷ [Number of wheels each motorcycle has]”. Then, goal vector q2 uses another approach to solve this problem, rather than following q1 to calculate [Total number of motorcycles’ wheels]. However, goal vector q7 only obtains two kinds of historical information through two recurrent neural networks (RNNs): information of each ancestor node’s goal vector (e.g. q1) through top-down goal decomposition (blue arrow); information of embeddings of all generated tokens through bottom-up subtree embedding (green arrows). That is to say, q7 can not get the information about the goal vectors of all generated nodes (i.e. q2, q3, q4, q5, q6) except the ancestor node q1. In the meantime, the embeddings of generated tokens can not reflect the changes in the solution strategy. Therefore, goal vector q7 is not notified of solution strategy changes, and still follows q1 to find [Number of wheels each motorcycle has]. This defect makes it difficult for GTS to handle the samples with complex math expressions.

Fig. 1
figure 1

The process of GTS’ node generation

To address this issue, we design a new goal-driven decoding approach which is called Goal Selection and Feedback (GSF). Figure 2 shows the process of generating the 4th node through GSF. Firstly, the Goal Feedback Operation updates the information of the 3th node to each goal vector (green arrows), allowing them to adjust themselves at each time step to provide more accurate information for the current decoding step. Then, the Goal Selection Operation selects the information from the updated previous goal vectors directly and generates the new goal vector q4 by attention mechanism (blue dotted arrow). Finally, token “48” is predicted according to q4 and context from encoded problem text (gray arrow). This approach allows the decoder to capture the most related information from all generated nodes in each time step. Because all generated nodes would be updated by the goal feedback operation according to the latest result and be provided to the goal selection operation. Moreover, we design Multilayer Fusion Network (MFN) to enhance the information fusion capacity instead of using a multilayer perceptron, which leads to a better representation for each hidden state. At last, we use ELECTRA language model [8] as the encoder. The proposed model is evaluated on the Math23k dataset, Ape-clean dataset, and MAWPS dataset. Experimental results show that our model can achieve better performance against strong baselines.

Fig. 2
figure 2

The process of Goal Selection and Feedback

The contributions of this paper are summarized here:

  • We propose the Goal Selection and Feedback (GSF) decoding approach, in which the goal feedback operation feds the latest result back to each generated goal, and the goal selection operation selects the updated past goals for decoding.

  • We design the Multilayer Fusion Network (MFN) to model the hidden states instead of using a multilayer perceptron.

  • The experimental results show our model outperforms several SOTA systems on the dataset Math23k.

2 Related work

In the first stage of the MWPs solving, from 1960 to 2010, systems such as STUDENT [9], DEDUCOM [10], WORDPRO [11] and ROBUST [12], manually designed rules and schemas for pattern matchings. Yun et al. also use schema for multi-step math problem solving, but the implementation details are not explicitly revealed [13]. These methods are complicated and difficult to reproduce.

After 2010, some researchers started to employ machine learning methods in solving MWPs. The MWPs solvers designed by them predicted and filled the predefined expression templates through traditional machine learning methods [14, 15]. Meanwhile, another way to solve MWPs is to use semantic parsing, which mapped the problem text to structured logic forms and inferred the answer through the predefined logic rules [16, 17]. Obviously, all approaches above required tremendous human efforts in feature engineering and annotation, which result in poor generality.

In recent years, deep learning has made great progress in various domains, such as machine translation [18,19,20], object detection [21,22,23], text classification [24,25,26], and dialogue system [27,28,29]. It is not surprising to notice that MWPs can be better solved with DL-based methods. The first DL-based MWP solver was an improved Seq2Seq model, which has been widely applied to translation tasks and question-answer tasks [1]. In order to improve the generality of DL-based models for solving MWPs, researchers proposed Significant Number Identification (SNI) and Equation Normalization (EN) [1, 2]. Due to the fact that mathematical expressions can be converted into binary trees, some researchers discarded the linear decoder in Seq2Seq and instead designed several brand new tree-structured recursive neural networks for model decoding, such as abstract syntax tree decoder (AST-Dec) [30] and GTS [4].

With the prevailing application of GTS, some researchers focused on improving the model’s understanding of the problem text and used GTS as the decoder of their models. Zhang et al. proposed a novel graph-based encoder to learn the quantity-related features for enhancing problem understanding [5]. Imitating human reading habits, Lin et al. proposed a hierarchical word-clause-problem encoder and applied a hierarchical attention mechanism to enhance the problem semantics with context from different levels, and a pointer-generator network to guide the model to copy existing information and infer extra knowledge in decoding [6].

In addition to designing DL-based MWPs solvers, some researchers look for other ways to enhance the accuracy of the model. Shen et al. devise a new ranking task for MWP and proposed Generate & Rank, a multi-task framework based on a generative pre-trained language model, the model learns from its own mistakes and is able to distinguish between correct and incorrect expressions [31]. Instead of using GTS, Lee et al. proposed a TM-generation model that uses the decoder of Transformer to predict the math expression templates, and then fills the missing operators in the predicted templates by the operator identification layer that they designed [32].

3 Problem definition and data processing

The formal definition of a Math Word Problem sample is (P,E) and the data processing is shown in Fig. 3, where:

  • The problem text P is a sequence of n word tokens or numeric values: P = {p1, p2,...,pn}, where pi is either a word token or a numeric value. According to SNI [1], the i th numeric value pk appearing in P is denoted as Ni. We replace pk with the token “NUM” in the original sequence P. At the end, the start token “SOS” and the ending token “EOS” are added to the beginning and the end of the sequence respectively. Finally we get a sequence \(P^{\prime }\) of m = n + 2 tokens.

  • The math expression E derived from problem text P is used to calculate the unknown quantity required by MWP. Expression E is defined as a sequence containing math operators from Vop, numeric constants from Vcon, and denoted numeric values in nP from the problem text P. Therefore, the output vocabulary of the our MWP solver is Vdec = VopVconnP. We replace all numeric values from the original expression with the corresponding number token NinP and convert the expression into the corresponding prefix form. In the end, the start token “SOS” and the ending token “EOS” are added to the beginning and the end of the expression respectively.

Given the problem text P, our model computes the output sequence \(\hat {E}\) according to P.

Fig. 3
figure 3

Data processing

4 Model description

4.1 Overview

To address the shortcomings of GTS, we propose a Goal-Driven MWP Solver with Goal Selection and Feedback. We follow the idea of the goal-driven mechanism, which means that our model would generate a new goal vector at each time step to guide the model to find the solution. Figure 4 shows the overview structure of our model. First, we input the processed text into the ELECTRA language model for encoding. Second, through the main goal initialization, we get the main goal vector q0, which represents the problem in the text that we need to solve. At the same time, we assume that the initial token “SOS” is generated by q0, which is the starting point of the expression generating process.

Fig. 4
figure 4

Overview of our proposed model

As we can see from the Fig. 4, the computation of our model has two directions: Goal Feedback Operation in the vertical direction (green half arrow) and Goal Selection Operation in the parallel direction (blue half arrow). In the vertical direction, each generated goal vector adjusts itself according to the last generated result by the Goal Feedback Operation (green arrows and orange arrows in Fig. 4). After that, each new goal vector is generated by the Goal Selection Operation (black dotted boxes and blue dotted arrows in Fig. 4). The Goal Selection Operation is based on the attention mechanism, which means that every goal vector in the black dot boxes has the opportunity to affect the generation of the latest goal vector. On the other hand, we treat the latest output token as the partial solution to the problem, and each goal would be resolved to some extent when a solution comes out. Therefore, we believe that when the latest result is generated, each goal vector should be adjusted itself to provide more accurate information for the generation of the next goal vector.

When the latest goal vector is generated, our model summarizes relevant information into the context vector from the encoded text through the attention mechanism (black arrows in Fig. 4). Using the goal vector and the context vector, we score every token yVdec and the token with the highest score is selected as the output of decoding (gray arrows and gray box in Fig. 4). Then the above steps are repeated until the ending token “EOS” is decoded.

4.2 Electra language model

In order to enhance the model’s understanding of the problem text, ELECTRA language model [8] is chosen as the encoder of our model, instead of encoding the problem text sequence simply through the RNN, such as LSTM and GRU. The authors of ELECTRA proposed a more sample-efficient pre-training task called replaced token detection and trained a discriminative model to predict whether each token was replaced or not. Experiments show that the contextual representations learned by the ELECTRA language model substantially outperform the ones learned by BERT given the same model settings [8].

After tokenization, the input problem text is converted to X = {x1, x2,...,xm}, and each token xk indicates the position of the corresponding word token pk in the input vocabulary. Then, the sequence X is input to the ELECTRA language model.

$$ H = \text{ELECTRA}(X) $$
(1)

where ELECTRA(⋅) denotes the function of ELECTRA language model. The ELECTRA language model accepts a token sequence X as input, and outputs the encoded sequence \(H = \{\mathbf {h}^{\prime }_{1},\mathbf {h}^{\prime }_{2},...,\mathbf {h}^{\prime }_{m}\} \) through a series of calculations, such as multi-head attention, self-attention, and position-wise feed-forward networks. Then, we calculate hi as follows, and treat it as the contextual representation of the token \(p_{i} \in P^{\prime }\).

$$ \mathbf{h}_{i} = \text{LN}(\mathbf{W}_{e} \mathbf{h}^{\prime}_{i}),~~~i=1,2,...,m $$
(2)

where LN(⋅) denotes the layer normalization layer [33], \(\mathbf {W}_{e} \in \mathbb {R}^{d \times d}\) is a trainable matrix and d is the dimensionality of the ELECTRA’s output.

4.3 Dynamic token embedding

To make the prediction of the model more accurate, we use dynamic token embedding like GTS [4]. Diffrent from operator tokens and constant tokens, token y = NiNP treats the contextual representation vector hj as its embedding, where j is the index position of Ni in the problem text \(P^{\prime }\). Each token yVopVcon has the same embedding in all problems.

$$ \mathbf{e}(y) = \left\{\begin{array}{ll} \mathbf{M}_{s}(y) & \text{if } y \in V_{op} \cup V_{con} \\ \mathbf{h}_{loc(y,P^{\prime})} & \text{if } y \in N_{P} \end{array}\right. $$
(3)

where e(⋅) denotes the embedding mapping function, \(\mathbf {M}_{s} \in \mathbb {R}^{d \times \lvert V_{op} \cup V_{con} \rvert }\) is a trainable embedding matrix, and \(loc(y,P^{\prime })\) is the index position of y in \(P^{\prime }\). In this way, every token from the common part VopVcon of all problems’ target vocabulary has the same representations in all problems, while the representations of token y = NiNP in each problem are different. Such embedding mapping is intuitive and the meaning of the same numeric value token Ni in every problem must be different.

4.4 Multilayer fusion network

To improve the information fusion capacity of our network, we propose the Multilayer Fusion Network (MFN) to generate several hidden states. Given the inputs p and g, the n-layer MFN calculates the result through the following iterative steps:

$$ \begin{array}{@{}rcl@{}} \mathbf{o}_{0} &=& \text{ReLU}(\mathbf{W}_{0}[\mathbf{p},\mathbf{g}]) \\ \mathbf{s}_{k+1} &=& \text{ReLU}(\mathbf{W}_{s_{k+1}}[\mathbf{p},\mathbf{g}]) \\ \mathbf{z}_{k+1} &=& \text{ReLU} \quad (\mathbf{W}_{z_{k+1}}[\mathbf{p}, \text{LN}(\mathbf{o}_{k}),\mathbf{s}_{k+1}]) \\ \mathbf{o}_{k+1} &=& \mathbf{o}_{k} - \quad \mathbf{s}_{k+1} * \mathbf{z}_{k+1},~~~k=0,1,2,...,n-1 \\ && \text{MFN} (\mathbf{p},\mathbf{g}) = \text{LN}(\mathbf{o}_{n}) \end{array} $$
(4)

where \(\mathbf {W}_{0} \in \mathbb {R}^{d \times 2d},\mathbf {W}_{s_{1}} \in \mathbb {R}^{d \times 2d},...,\mathbf {W}_{s_{n}} \in \mathbb {R}^{d \times 2d},\mathbf {W}_{z_{1}}\in \mathbb {R}^{d \times 3d},...,\mathbf {W}_{z_{n}} \in \mathbb {R}^{d \times 3d}\) are trainable matrices, [⋅,⋅,...,⋅] denotes concatenation, * denotes the hadamard product and ReLU(⋅) denotes the ReLU function. o0 is the initial vector of combining the information of p and g. sk is the k th infromation supply vector, and zk is the k th scale gate of sk. The k th scaled supply vector skzk decides what information should be added to ok. Through subtraction, ok receives the information provided by skzk. The iterative multilayer structure is designed to allow MFN to provide deeper information fusion.

4.5 Goal selection and feedback

Main goal initialization

To start the decoding process, we use token y0 =“SOS” as the start token of decoding, and the corresponding initial goal vector q0 and the initial prediction vector u0 are calculated as follows:

$$ \begin{array}{@{}rcl@{}} \mathbf{q}_{0} &=& \mathbf{h}_{1} \\ \mathbf{u}_{0} &=& \text{LN}(\text{ReLU}(\mathbf{W}_{m}\mathbf{q}_{0})) \end{array} $$
(5)

where h1 is the contextual representations of the token “SOS” \(\in P^{\prime }\), and \(\mathbf {W}_{m} \in \mathbb {R}^{d \times d}\) is a trainable matrix. The initial goal vector q0 represents the main goal to be solved, which is posed by the problem text directly. With the initial token and the corresponding goal vector q0 and prediction vector u0, the decoder performs each step of decoding through the following iterative steps.

Goal feedback operation

Given a main goal vector q0, the k th prediction vector uk and the k th output token yk, the decoder merges them through a single layer perceptron to get rk as the k th solution vector. Then, the update gate \({g^{k}_{i}}\) of the i th goal vector is calculated as follows:

$$ \begin{array}{@{}rcl@{}} \mathbf{r}_{k} &=& \text{LN}(\text{ReLU}(\mathbf{W}_{r}[\mathbf{q}_{0}, \mathbf{u}_{k},\mathbf{e}(y_{k})])) \\ {g^{k}_{i}} &=& \sigma (\mathbf{v}_{g}^{\top}\text{ReLU}(\mathbf{W}_{g}[\mathbf{r}_{k},\mathbf{q}_{i}])) \end{array} $$
(6)

where \(\mathbf {v}_{g} \in \mathbb {R}^{d}\), \(\mathbf {W}_{r} \in \mathbb {R}^{d \times 3d}\) and \(\mathbf {W}_{g} \in \mathbb {R}^{d \times 2d}\) are trainable parameters, and σ(⋅) denotes the sigmoid function. Then, each goal vector adjust themselves according to the update gate \({g^{k}_{i}}\) and the feedback vector \(\mathbf {f}^{ k}_{ i}\) calculated by the trainable network MFNf as follows (green arrows in Fig. 5)

$$ \begin{array}{@{}rcl@{}} \mathbf{f}^{ k}_{ i} &=& \text{MFN}_{f}(\mathbf{q}_{i},\mathbf{r}_{k})\\ \mathbf{q}_{i} &:=& (1 - {g^{k}_{i}}) \cdot \mathbf{q}_{i} + {g^{k}_{i}} \cdot \mathbf{f}^{ k}_{ i} ,~~~i=0,1,2,...,k \end{array} $$
(7)
Fig. 5
figure 5

Process of goal selection and feedback decoding approach

Goal selection operation

After the goal feedback operation, we calculates the selection weight \({a_{i}^{k}}\) by k th solution vector rk and each goal vector q0, q1,...,qk (blue arrows and blue dotted arrows in Fig. 5):

$$ \begin{array}{@{}rcl@{}} \text{S}_{g}(\mathbf{r}_{k},\mathbf{q}_{i}) &=& \mathbf{v}_{s}^{\top}\text{ReLU}(\mathbf{W}_{s}[\mathbf{r}_{k},\mathbf{q}_{i}]) \\ {a_{i}^{k}} &=& \frac{{\exp}(\text{S}_{g}(\mathbf{r}_{k},\mathbf{q}_{i}))}{{\sum}_{s}\exp(\text{S}_{g}(\mathbf{r}_{k},\mathbf{q}_{s}))} \end{array} $$
(8)

where \(\mathbf {v}_{s} \in \mathbb {R}^{d}\) and \(\mathbf {W}_{s} \in \mathbb {R}^{d \times 2d}\) are trainable parameters. Then the (k + 1)th goal vector qk+ 1 is calculated as follows:

$$ \begin{array}{@{}rcl@{}} \boldsymbol{\tilde{q}}_{i}^{k} &=& \text{MFN}_{s}(\mathbf{r}_{k},\mathbf{q}_{i}) \\ \mathbf{q}_{k+1} &=& \sum\limits_{i=0}^{k+1} {a_{i}^{k}}\tilde{\mathbf{{q}}}_{i}^{k} \end{array} $$
(9)

where MFNs is a trainable network, \(\tilde {\mathbf {q}}_{i}^{k}\) is the sub-goal vector derived by the solution vector rk according to the i th goal vector qi, and weighted summation is performed to obtain the new goal vector qk+ 1.

Token prediction

Given the goal vector qk+ 1, in order to accurately predict the output token, we compute the context vector ck+ 1 as follows to summerize the relevant information of the problem through the attention mechanism (black dotted arrow in Fig. 5):

$$ \begin{array}{@{}rcl@{}} \text{S}_{c}(\mathbf{q}_{k+1},\mathbf{h}_{i}) &=& \mathbf{v}_{c}^{\top}\text{ReLU}(\mathbf{W}_{c}[\mathbf{q}_{k+1},\mathbf{h}_{i}]) \\ w_{i} &=& \frac{{\exp}(\text{S}_{c}(\mathbf{q}_{k+1},\mathbf{h}_{i}))}{{\sum}_{s}\exp(\text{S}_{c}(\mathbf{q}_{k+1},\mathbf{h}_{s}))} \\ \mathbf{c}_{k+1} &=& \sum\limits_{i=1}^{m} w_{i}\mathbf{h}_{i} \end{array} $$
(10)

where \(\mathbf {v}_{c} \in \mathbb {R}^{d}\) and \(\mathbf {W}_{c} \in \mathbb {R}^{d \times 2d} \) are trainable parameters. Next, the prediction vector uk+ 1 is computed by combining the goal vector qk+ 1 with context vector ck+ 1, and then probability prob(yi|uk+ 1) of each token yiVdec is calculated as (black arrows in Fig. 5):

$$ \mathbf{u}_{k+1} = \text{MFN}_{c}(\mathbf{q}_{k+1},\mathbf{c}_{k+1}) $$
(11)
$$ \begin{array}{@{}rcl@{}} \text{s}(y_{i}, \mathbf{u}_{k+1}) &=& \mathbf{v}_{p}^{\top}\text{ReLU}(\mathbf{W}_{p}[\mathbf{u}_{k+1},\mathbf{e}(y_{i})]) \\ \text{prob}(y_{i} \vert \mathbf{u}_{k+1}) &=& \frac{{\exp}(\text{s}(y_{i}, \mathbf{u}_{k+1}))}{{\sum}_{s}\exp(\text{s}(y_{s}, \mathbf{u}_{k+1}))} \end{array} $$
(12)

where MFNc is a trainable network, e(yi) is the token embedding of yi calculated by (3), \(\mathbf {v}_{p} \in \mathbb {R}^{d}\) is a trainable vector, and \(\mathbf {W}_{p} \in \mathbb {R}^{d \times 2d}\) is a trainable matrix. Finally, token \(\hat {y}_{k+1}\) with the highest probability is selected as the output token of the current decoding step (gray arrows in Fig. 5):

$$ \hat{y}_{k+1} = \underset{y \in V_{dec}}{\arg\max} ~\text{prob}(y | \mathbf{u}_{k+1}) $$
(13)

when the token \(\hat {y}_{k+1}\) is “EOS”, the decoder completes the decoding of the problem, otherwise, it continues with the next decoding step.

4.6 Model training

Formally, for each (Pi, Ei) in the training dataset \(\mathbb {D}=\{(P^{i}, E^{i}) | 1 \le i \le N\}\), Pi is the problem text sequence, and Ei is the math expression sequence corresponding to the problem Pi. The loss function Lossi is defined as the sum of the negative log-likelihood of probabilities for predicting t th token ytEi. The total loss function is calculated as follows:

$$ \begin{array}{@{}rcl@{}} Loss_{i} &=& -\sum\limits_{t=1}^{T}log\text{prob}(y_{t} \vert \mathbf{u}_{t}) \\ Loss_{\text{total}} &=& \sum\limits_{i=1}^{N}Loss_{i} \end{array} $$
(14)

where ut is the prediction vector of the t-th node; T is the number of tokens in Ei, and prob(⋅|⋅) is computed by (12). The training goal of the model is to minimize Losstotal.

5 Experiment

In this section, we compare our model with several SOTA baselines, then perform ablation experiments on each module, and finally end up with analyzing the experimental results.

5.1 Dataset and baselines

Dataset

We conduct the experiments on the following datasets:

  • Math23kFootnote 1: The dataset Math23k [1] is a commonly-used large-scale Chinese MWPs dataset, containing 22,161 training problems and 1,000 testing problems with solution expressions and answers. Each math word problem can be solved by one linear algebra expression.

  • Ape-cleanFootnote 2: The dataset Ape-clean [34] is the cleaned version of the Chinese MWPs dataset Ape210k [35]. After cleaning, Ape-clean contains 102,596 training problems and 2,422 testing problems. Each math word problem can be solved by one linear algebra expression.

  • MAWPSFootnote 3: The dataset MAWPS [36] contains English math word problems with one or more unknown variables. We select 2,353 problems with only one unknown variable and perform five-fold cross validation on it.

Baselines

We compare the following methods on datasets Math23k and Ape-clean:

  • MathEN [2]: The ensemble model selects the result according to the models’ generation probability among BiLSTM, ConvS2S, and Transformer with equation normalization (EN).

  • GroupAtt [37]: The Seq2Seq model with the group attention mechanism to extract global features, quantity-related features, quantity-pair features, and question-related features in MWPs respectively.

  • AST-Dec [30]: This MWP solver uses LSTM for encoding and generates the abstract syntax tree of the equation in a top-down manner when decoding.

  • GTS [4]: The Goal-Driven Tree-structured MWP Solver, using GRU for encoding.

  • Graph2Tree [5]: The MWP Solver with graph-based encoder and GTS decoder.

  • SAUSolver [38]: The semantically-aligned universal tree-structured solver based on an encoder-decoder framework.

  • Generate & Rank [31]: The pre-trained-model-based MWP solver with equation re-ranking mechanism.

  • HMS [6]: The MWP solver with a dependency-based module for encoding and an improved GTS decoder.

  • TM-generation [32]: The MWP solver uses the decoder of Transformer to predict math expression templates, and then fills the missing operators in the predicted templates by the operator identification layer they designed.

  • ELECTRA-GTS [4]: To compare in the same understanding level of the problem text, we replace the GTS’s encoder with the ELECTRA language model.

  • ELECTRA-GRU-xL [39]: Similarly, we construct an MWP solver combining the ELECTRA language model, GRU decoder, and cross-attention module for comparing. “xL” denotes that the decoder has x GRU layers.

  • ELECTRA-TFM-xL [40]: Similarly, we construct an MWP solver combining the ELECTRA language model and transformer decoder. “xL” denotes that the decoder has x transformer decoder layers.

  • GSGSF-xL: The Goal-Driven MWP Solver with Goal Selection and Feedback proposed in this paper. “xL” denotes that all MFNs in the decoder are set to x layers.

5.2 Implementation details

Our model is implemented on the Ubuntu system using PyTorch and trained on RTX3090. All math expressions of the MWP samples are converted to the corresponding prefix expressions. For the Chinese ELECTRA language model, we use the version pre-trained on an 180G Chinese large corpus [41]. The dimensionalities of all hidden states of the decoder are set to 768. Our model is trained for 80 epochs, and the mini-batch size is set to 32. In terms of the optimizer, we use AdamW [42] with the value of learning rate set to 2 × 10− 5 and 1 × 10− 3 in the encoder and decoder respectively. The learning rate will be halved when the loss reduction is less than 0.1 times the current loss. For initialization, all the learnable parameters are sampled on the normal distribution \(N\left (0, \left (\frac {0.5}{\sqrt {d}}\right )^{2}\right )\), where d = 768 is the dimensionalities of all hidden states. The results are computed through the greedy search.

Finally, we use answer accuracy and equation accuracy as the metrics to evaluate the model. When using answer accuracy, the prediction is considered correct when the calculated value of the predicted expression is equal to the answer. When using equation accuracy, the prediction is considered correct when the predicted expression is the same as the labeled expression. The answer accuracy rate can demonstrate the MWP solving ability of the models. The equation accuracy in the training set can demonstrate the fitting ability of the models.

5.3 Result analysis

Overall result

Table 2 reports the answer accuracy of various baseline models and our proposed model on the Math23k, Ape-clean, and MAWPS datasets. As shown in Table 2, firstly, the models containing pre-trained language model (Generate & Rank, TM-generation, ELECTRA-GTS, ELECTRA-GRU, ELECTRA-TFM, and GSFSF) can better solve the MWPs, which confirms that the strong encoding ability of the pre-trained language model brings a big improvement to the model’s performance. Next, our model outperforms all baseline models on the Math23k and Ape-clean datasets, which demonstrates that our model is better than other decoders in Chinese MWP solving. In the small English dataset MAWPS, the performance of GSFSF is the same as that of TM-generation and ELECTRA-TFM.

Table 2 Answer accuracy of our model and baseline models

Performance over four decoder

In order to verify our point, we train four different decoders with the same encoder (ELECTRA language model). Here is the detailed difference in how these decoders feed the latest result back to the next decoding process:

  • GTS: As the decoder of ELECTRA-GTS, GTS feeds the hidden vectors (goal vectors) of the parent node and sibling subtree to the next decoding process. It cannot obtain the information of all generated nodes.

  • GRU: As the decoder of ELECTRA-GRU, GRU adjusts the hidden vector according to the latest result for the next decoding process. It can keep the historical information in the hidden vector, but it is possible to forget the information of the earlier nodes.

  • Transformer: As the decoder of ELECTRA-TFM, Transformer treats the embedding of the latest result as the new hidden vector and performs self-attention to capture the related history information for decoding the new token. But in the self-attention of each time step, the hidden vector of the same node is unchanged. So it requires multi-layer stacking, that is, utilizing self-attention and feed-forward network (FFN) multiple times to extract information related to the current decoding process.

  • GSF: As the decoder of GSFSF, GSF has the goal selection operation which is similar to the self-attention for decoding. But the goal selection operation can capture the most related information directly from all generated nodes in each time step. Because all these generated nodes have been always updated by the goal feedback operation according to the latest result in real-time and be provided to the goal selection operation.

First, when GRU is utilized as the decoder, it outperforms GTS on three datasets. This indicates that a linear-structured decoder is not bad for the MWP task. It may be due to the fact that the length of mathematical expressions is not long enough to cause gradient diffusion in RNN. Then, we can find that our model and the transformer decoder perform similarly on the Ape-clean dataset, but there is a gap in the Math23k dataset. This is due to the large number of samples in the Ape-clean dataset, providing more abundant samples for model training. And the difference in performance on the Math23k dataset shows that GSF has better generalization than the Transformer, namely, it requires fewer samples to learn the mathematical relationship in MWP. In the English dataset MAWPS, the Transformer and GSF achieve the same answer accuracy.

Performance over number of layers

The number of layers is a tunable hyperparameter in ELECTRA-GRU, ELECTRA-TFM, and our model. It should be noted that the layer stacking of our model only exists in the MFN module, not in the entire decoder. We vary the number of layers from 2, 4, and 6 for investigating the effect of the number of layers on the model’s performance.

The number of layers of the model can reflect the complexity of the model to a certain extent. When the number of layers increases, the fitting ability of the model tends to be stronger. According to the principle of Occam’s Razor, the optimal complexity of the neural network model is the minimum complexity that can fit the training set. At this time, the model has the best generalization. When the complexity of the model exceeds the optimal complexity, over-fitting often occurs, resulting in the decline of the models’ performance on the test set.

The results of the study are shown in Table 2 and Fig. 6. First, whether on Math23k or Ape-clean, the increase in the number of layers in ELECTRA-GRU gradually degrades the performance of the model. It may be due to the fact that 2-layer is already the optimal complexity of ELECTRA-GRU. Second, the answer accuracy of ELECTRA-TFM rises slowly when the number of layers increases, which indicates that the Transformer decoder will approach its optimal complexity only when its complexity is high enough. However, when the number of layers is 2,4,6, the number of parameters of the Transformer decoder is 19M, 38M, and 57M respectively, and the enhancement of accuracy brought by the increase of a large number of parameters is weak. Finally, the layer stacking of the GSFSF decoder exists only in MFN, and the model achieves the highest accuracy on Math23k and Ape-clean when the number of layers is 2 and 4, respectively. When the number of layers is 2 and 4, the number of parameters of the GSFSF decoder is 27M and 44M respectively. At this time, GSFSF not only has fewer parameters than the 6-layer Transformer decoder but also achieves higher accuracy.

Fig. 6
figure 6

Performance over number of layers in dataset Math23k

Ablation study

The Goal Selection and Feedback and MFN are central to our decoder structure. To investigate their effectiveness, we conduct several ablation experiments on GSFSF-2L. Table 3 shows the results of the experiment, where “w/o Goal Feedback” denotes that the goal feedback operation is removed. “w/o Goal Selection” denotes that the k th solution vector rk is treated as the (k + 1)th goal vector qk+ 1, and then use it to perform the (k + 1)th step token prediction. “w/o MFN” denotes that MFN is replaced with the Feed-Forward Network (FFN) from Transformer, which is the 2-layer perceptron containing ReLU activation function and layer normalization layer.

Table 3 Answer accuracy of various decoder configurations

From Table 3 we can see that the lack of goal selection operation implies that our decoder degenerates to linear-structured RNN, but still achieves an answer accuracy of 0.855 and exceeds the ELECTRA-GTS on Math23k. Second, the lack of goal feedback operation implies that the decoder only uses goal selection operation for decoding, achieving better performance than the linear-structured decoder. Then, the original configuration of the decoder has the highest accuracy, which suggests that the goal feedback operation is helpful and complementary to the goal selection operation. Finally, it can be observed that there is a drop in accuracy when MFN is replaced with the FFN, which shows that MFN does help to generate better hidden state.

Performance on expression length

In order to compare the decoding ability of each decoder in more detail, we compute their answer accuracy on each expression length interval (prefix form) separately. The accuracy of models for each math expression interval on the Math23k test set, the Ape-clean test set and the MAWPS dataset are given in Tables 45 and 6 respectively. Five-fold cross-validation is used on the MAWPS dataset, so each sample will act as a test sample once during validation. Therefore, Table 6 shows the answer accuracy on all samples in the MAWPS dataset when they act as the test samples.

Table 4 Answer accuracy over expression length on Math23k test set
Table 5 Answer accuracy over expression length on Ape-clean test set
Table 6 Answer accuracy over expression length on MAWPS dataset

In the test set, We can see that the answer accuracy decreases as the length of expression increases. It is in line with the intuition that a longer math expression usually implies a higher complexity of the problem, and the proportion of the training samples in these intervals (over 7) is small. Second, GSF and Transformer outperform the GTS in all situations of different expression length sizes, which indicates that we can better model the mathematical relationship of MWP without a tree-structured neural network. Next, we can see that Transformer performs slightly better than GSF on samples with expression lengths between 3 and 9, and GSF achieves the highest answer accuracy on samples with expression lengths over 9. This indicates that the Transformer is more suitable for samples with medium expression lengths and our decoder has a greater advantage in handling difficult samples. Surprisingly, on the Math23k test set, GSF performs better on samples with expression lengths over 11 than that on samples with expression lengths between 7 and 11. This may be due to the consistency of the MWP types in the training samples and testing samples with expression lengths over 11. In the MAWPS dataset, the number of samples with expression lengths over 5 is only 185, accounting for only 7.2% of the total. In addition, due to the use of five-fold cross validation, the number of training samples must be multiplied by 4/5, resulting in fewer training samples. All of these make each model perform poorly on these samples.

In deep learning, a prerequisite for the model to be able to solve a certain kind of sample is that it can fit such kind of sample. Fitting samples means that the model generalizes the training samples into a variety of templates, and then the model can solve the problems by recalling those templates. Table 7 reports the fit of the model on the Math23k training set, where the equation accuracy in the training set can demonstrate the fitting ability of the models. In our model, the decoder is designed with long-range information acquisition (goal selection operation) and timely information updating (goal feedback operation) mechanisms. When the mechanism of model computation is sufficiently complex, flexible, and not redundant, it is possible to solve more problems by summarizing and memorizing more templates in training. From this point of view, an essential reason why GTS, GRU, and Transformer do not perform as well as GSF on samples with long expressions is that these three models are inferior to GSF in fitting long-expression samples. Namely, GTS, GRU (both with limited historical information acquisition), and Transformer (with only long-range information acquisition mechanism) remembered fewer templates than GSF during training.

Table 7 Equation accuracy over expression length on Math23k training set

5.4 Case study and visualization analysis

Further, we conduct case studies on expressions generated by our model and ELECTRA-GTS and visualization of model decoding in these cases. Three cases are provided in Table 8. Our analyses are summarized as follows:

  • From Case 1, it can be seen that ELECTRA-GTS generates the correct left subtree “××n2n0 − 1n3” at the beginning, but generates the wrong right subtree “× n2n0”, which demonstrates that the structure of GTS prevents it from getting sufficient information about the previous nodes. Instead, our model is able to obtain sufficient information from previous nodes and finally generates the correct solution for this MWP.

  • From Case 2, it can be seen that the first goal generated by ELECTRA-GTS is to find the ticket price of a swimming game, while the first goal generated by our model accurately answers the main question posed by the MWP to find the difference of two ticket price, and our model subsequently solves this MWP correctly. The difference between these two first goals indicates that MFN in our model can generate goal vectors with better representations than the two-layer gated-feedforward network of GTS.

  • Case 3 is the sample presented in the section Introduction on which GTS made mistake. We can find that our model solves this problem correctly, which shows that the Goal Selection and Feedback does ameliorate the shortage of GTS.

Table 8 Typical cases translated into english. The expressions between brackets are the corresponding midfix expressions of the models’ output

Figures 78 and 9 are the visualization heatmaps of our model’s decoding process in Case 1, Case 2 and Case 3 respectively. In the heatmap of Goal Feedback Operation, the shade of the box color indicates the value of the update gate \({g^{k}_{i}}\), that is, how much information of the last generated node yk is fed back to the i th goal. In the heatmap of Goal Selection Operation, the shade of the box color indicates the value of the selection weight \({a^{k}_{i}}\), that is, how much information is selected from the i th goal by the new goal qk+ 1.

Fig. 7
figure 7

Visualization of model decoding in case 1

Fig. 8
figure 8

Visualization of model decoding in case 2

Fig. 9
figure 9

Visualization of model decoding in case 3

Firstly, it can be seen that the first few goal vectors are updated frequently in Goal Feedback Operation. Because the first few goals are the parent goals of subsequent new sub-goals. When a sub-goal is solved, its parent goal is also solved partially and updated to inform the subsequent decoding process. An interesting point is that after the last math token is generated (the last row in each heatmap of Goal Feedback Operation), all parent goals are completely updated (\({g^{k}_{i}} \approx 1\)), and then the ending token “EOS” is generated. Next, in the Goal Selection Operation, the selection range of the new goal often includes its nearby operator, which is its parent node in the binary tree corresponding to the math expression. This shows that the neural model can still learn the tree structure in the math expression without the explicit tree structure.

6 Conclusion

In this paper, we propose a novel decoder that is more suitable for MWP tasks than GTS, especially for long math expressions. Our model uses Goal Selection and Feedback and Multilayer Fusion Network for each decoding step, allowing sufficient history information for decoding and better representation for each hidden state. Combining the ELECTRA language model with our decoder, the experimental results demonstrate that our model indeed overcomes the shortcomings of GTS very well and outperforms the previous SOTA systems. For future work, we will focus on improving the generalization of the model to make it perform better on complex samples with small sample sizes.