1 Introduction

Foundation models are large-scale language models that contain a large number of parameters and are pretrained on massive amounts of text data, often on hundreds of millions or even billions of words. The pretraining and large-scale parameters allow them to generate high-quality human-like responses in a wide range of tasks and applications that NLP researchers previously thought required language understanding, such as question-answering, dialogue generation, and mathematical reasoning Bommasani et al. (2022). Recently released large language models, such as Open-AI’s GPT-3 (Brown et al., 2020), Google’s FLAN-T5 Wei et al. (2021), and Facebook’s LLaMA (Touvron et al., 2023), are some of the most well-known foundation models that are dominating the field of natural language processing. They achieve human-level performance on various language tasks and have the ability to follow human-defined instructions. However, it is still unclear whether these large foundation models have the ability to perform complex logical reasoning comparable to human skills. Thus, our motivation is to uncover the ability and limitations of neural foundation models in monotonicity reasoning and investigate how we can approach logical reasoning in the age of neural foundation models. We focus our discussion mainly on the monotonicity-based Natural Language Inference task.

Natural Language Inference (NLI), also known as recognizing textual entailment (RTE), is one of the important benchmark tasks for natural language understanding. Many other language tasks can benefit from NLI, such as question answering, text summarization, and machine reading comprehension. The goal of NLI is to determine whether a given premise P semantically entails a given hypothesis H (Dagan et al., 2013). Consider the following example:

  • P: An Irishman won the Nobel Prize for literature.

  • H: An Irishman won the Nobel Prize.

The hypothesis can be inferred from the premise, and therefore the premise entails the hypothesis. To arrive at a correct determination, an NLI model often needs to make different inferences, including various types of lexical and logical inferences. In this paper, we are concerned with monotonicity reasoning, a type of logical inference that is based on word or phrase replacement Hu et al. (2019). Below is an example of monotonicity reasoning:

  1. 1.
    1. (a)

      All students \(\downarrow \) carry a MacBook \(\uparrow \).

    2. (b)

      All students carry a laptop.

    3. (c)

      All new students carry a MacBook.

  2. 2.
    1. (a)

      Not All new students \(\uparrow \) carry a laptop.

    2. (b)

      Not All students carry a laptop.

A phrase in upward entailment context (\(\uparrow \)) can allow inference from (1a) to (1b), where a more general concept laptop replaces the more specific MacBook. A downward entailing phrase (\(\downarrow \)) allows an inference from (1a) to (1c), where a more specific context new students replaces the word students. The direction of the monotonicity can be reversed by adding a downward entailing phrase like “Not"; thus, (2a) entails (2b).

In this paper, we provide an in-depth discussion on monotonicity reasoning in the age of neural foundation models in the aspects of methodology and analysis. First, we investigate whether incorporating both advanced neural network mechanisms, like attention with structural sentence knowledge based on the linguistic principle of compositionality, can achieve accurate and robust monotonicity reasoning in the form of NLI. We propose an AttentiveTreeNet that contains a Tree-LSTM encoder with an attention mechanism and a multi-hop self-attention aggregator for NLI classification. We evaluate AttentiveTreeNet on the MED Yanaka et al. (2019) benchmark and show that it significantly outperforms a high-quality foundation model BERT Devlin et al. (2019).

Next, we propose a symbolic reasoning system that performs monotonicity reasoning based on polarity marks and incorporates neural language models to handle syntactic variations in the data. Our proposed system, called NeuralLog, achieve state-of-the-art performance on the MED benchmark that significantly outperforms prior neural network models. The advantage of NeuralLog is its ability to perform step-by-step reasoning based on human-defined symbolic logic rules while resolving syntactic variations using neural language models, which makes its reasoning much more robust and generalizable than prior logic reasoning systems.

In the last part, we benchmark pretrained models fine-tuned on massive task-specific training data (with parameter sizes \(\le \) 11 billion) and large-scale language models (with parameter sizes \(\ge \) 11 billion) on monotonicity reasoning through instruction-based zero-shot learning and in-context-based few-shot learning. Our objective is to assess whether these large foundation models have the ability to emulate logical reasoning since they have shown impressive performance on various linguistic tasks and applications. Our evaluation shows that current large language models still fail to perform logical reasoning well. Large language models only achieve random performance despite instructions and few-shot examples on the monotonicity test set from the CURRICULUM benchmark Chen and Gao (2022), which is a curated mixture of the MED Yanaka et al. (2019) and Semantic Fragments Richardson et al. (2019) datasets.

Overall, we show that although large-scale foundation models are dominating the field of natural language processing by mastering many tasks and applications, they still cannot emulate logical reasoning like monotonicity inference. Meanwhile, symbolic reasoning systems that incorporate neural language models can achieve state-of-the-art performance that is interpretable and robust. A subset of this work was previously published as Chen (2021) and Chen et al. (2021).

2 Attentive Tree Structured Network

2.1 Preliminaries

In this section, we propose a tree-structured long-short-term memory (LSTM) network in which the syntactic information of a sentence is encoded, and the alignment between the premise-hypothesis pair is calculated through a self-attention mechanism. A standard sequential LSTM (Wang & Jiang, 2016) network only permits sequential information propagation. However, the linguistic principle of compositionality states that an expression’s meaning is derived from the meanings of its parts and of the way they are syntactically combined (Partee, 2007). A tree-structured LSTM network allows each LSTM unit to be able to incorporate information from multiple children’s units. This takes advantage of the fact that sentences are syntactically formed bottom-up tree structures.

2.2 Method

Tree-LSTM Encoder

The main architecture builds from the Child-Sum Tree-LSTMs (Tai et al., 2015), where the computation of a hidden state is conditioned on both the current input and the hidden states of an arbitrary subset of children nodes. This property allows the recursive computation of non-leaf nodes’ relation representations by composing children relations, which can be viewed as natural logic for neural models (MacCartney & Manning, 2009; Zhao et al., 2016). The computation flow in an LSTM cell is as follows:

$$\begin{aligned} {\tilde{h}}&= \Sigma _{1 \le k \le n}h_{k}, \\ i&= \sigma (W^{(i)}x+U^{(i)}{\tilde{h}}+b^{(i)}), \\ o&= \sigma (W^{(o)}x+U^{(o)}{\tilde{h}}+b^{(o)}), \\ u&= \tanh (W^{(u)}x+U^{(u)}{\tilde{h}}+b^{(u)}), \\ f_{k}&= \sigma (W^{(f)}x+U^{(f)}h_{k}+b^{(f)}), \\ c&= i \odot u+\Sigma _{1 < n}f_{k} \odot c_{k}, \\ h&= o \odot \tanh (c), \end{aligned}$$

where k is the number of children of the current node, and \({\tilde{h}}\) is the sum of the hidden states from the current node’s children. The forget gate \(f_{k}\) controls the amount of memory being passed from the kth child. The input gate i controls the amount of internal input u being updated, and the output gate o controls the degree of exposure of the memory. The \(\sigma \) is the sigmoid activation function, \(\odot \) is the element-wise product, and W and U are trainable weights to be learned.

Attention Mechanism

We propose incorporating the attention mechanism Zhou et al. (2016) in the LSTM network. Attention considers contextual relevance by assigning higher weights to children that are more relevant to the context. We apply a soft-attention layer, which receives a set of hidden states \(\{h_{1},h_{2},...,h_{n}\}\) and a vector representation s of a sentence computed from a layer of sequential LSTM. The attention layer assigns a weight \(\alpha \) for each hidden state and computes the context vector g as a weighted sum:

$$\begin{aligned} m_{k}&= \tanh (W^{(m)}h_{k} + U^{(m)}s), \\ \alpha _{k}&= \frac{e^{w^{\top } m_{k}}}{\sum _{j=1}^{n}e^{w^{\top } m_{j}}}, \\ g&= \sum _{1 \le k \le n}\alpha _{k}h_{k}. \end{aligned}$$

The hidden state for the next cell is then computed via a transformation \({\tilde{h}} = \tanh (W^{(a)}g + b^{(a)})\).

Self-Attention Aggregator

We encode the premise and hypothesis using the attentive encoder, concatenate the hidden states into a pair of matrices \(H_p\) and \(H_h\), and passed to a self-attentive aggregator. To aggregate, we first apply a multi-hop self-attention mechanism (Lin et al., 2017). Performing multiple hops of attention helps the model to get multiple attention focusing on different sentence parts since multiple components form the sentence context. Given a matrix H, we perform multiple hops of attention to compute an annotation matrix A, consisting of the weight vector from each hop. A is calculated from a 2-layer multi-layer perceptron (MLP) and a softmax function: \(A = \textrm{softmax}(W_{s2}tanh(W_{s1}H^{\top }))\). The annotation matrix is multiplied by the hidden states H to obtain a context matrix: \(M = AH\). With a pair of context matrices \(M_p\) and \(M_h\), we compute the outputs as:

$$\begin{aligned} F_p = \tanh (M_p \times W_f), \;\;\;\; F_h = \tanh (M_h \times W_f). \end{aligned}$$
(1)

To aggregate \(F_p\) and \(F_h\), we follow a generic NLI training scheme Conneau et al. (2017) to include three matching methods: (I) concatenation, (ii) absolute distance, and (iii) element-wise product. Results from the three methods are then concatenated: \(F_r = [F_p; F_h; \Vert F_p - F_h\Vert ; F_p \odot F_h]\) as the factor of semantic relation between the two sentences. An MLP layer works as the classifier which predicts the label using the factor.

2.3 Evaluation

Datasets

We evaluate our proposed method on the Monotonicity Entailment Dataset (MED) Yanaka et al. (2019). MED is a high-quality benchmark that aims to examine models’ ability to perform monotonicity reasoning. MED covers various linguistic phenomena such as lexical knowledge, conjunction, disjunction, conditional, and negative polarity items. The dataset contains 5382 premise-hypothesis pairs, including 1820 examples for upward inference, 3270 for downward inference, and 292 neutral examples.

Setup and Baselines

Initially, we used the HELP dataset Yanaka et al. (2019) to train our model. HELP is a dataset for learning entailment with lexical and logical phenomena. It embodies a combination of lexical and logical inferences focusing on monotonicity. Next, we trained our model with the Multi-Genre NLI Corpus (MNLI) dataset Williams et al. (2018), which covers a wide range of genres of spoken and written language. The majority of the training examples in that dataset are upward monotone. To provide more balanced training data, we combined a subset of the MNLI dataset with the HELP dataset to reduce the effect of many downward monotone examples in the HELP dataset. Due to limited computation resources at the time of training, we only randomly sampled a subset of the MNLI dataset to reduce the training time period. We call this combined training data HELP+SubMNLI. We removed the contradicting examples from the MNLI dataset since the test dataset MED, and the training dataset HELP do not contain the label Contradiction.

Training

To train our model, we used Stanford’s pre-trained 300-D Glove 840B vectors (Pennington et al., 2014) to initialize the word embeddings. The Stanford Dependency Parser (Chen & Manning, 2014) was used to parse each sentence in the dataset. The model is trained with the Adam optimizer (Kingma & Ba, 2014), which is computationally efficient and helps a model to converge to an optimal result quickly. A standard learning rate for Adam, 0.001, is also used. Dropout with a standard rate of 0.5 is applied to the feed-forward layer in the self-attention aggregator and the classifier to reduce the over-fitting of the model. For the number of hops of self-attention, we used the default 15 hops. The metric for evaluation is accuracy based. The system is implemented using a common deep learning framework, PyTorch, and is trained on a T4 GPU for 20 epochs.

Table 1 Accuracy of our model and other state-of-art NLI models evaluated on MED

2.4 Results

2.4.1 MED Performance

Table 1 shows our method’s performance compared against common NLI methods on the Monotonicity Entailment Dataset (MED). Our model achieves an overall accuracy of 75.7% and outperforms all other models, including the pre-trained language model BERT, which previously showed SOTA performance on NLI tasks. On downward monotonicity reasoning, which is more difficult than upward, our method shows significant improvement in performance over the baselines, with 4.5% higher than the BERT model. Interestingly, our model achieves better performance on downward inference even when trained with HELP or MNLI alone (compared to baselines with similar training data). This shows a structural advantage of our model architecture over the baselines. On upward monotonicity reasoning, our model is only slightly behind the BERT model (1.3% apart) but still outperforms the other baselines with a large margin (10.3% to the best non-BERT baseline). Note that augmenting HELP with a subset of MNLI improves the performance on upward monotone (+25.7%), showing that training additionally on some general NLI examples helps the model to learn the upward inference. On examples without monotonicity inference, our method does not perform as well as the examples with monotonicity. This suggests that while achieving high performance on monotonicity reasoning, our method loses some ability to reason with the general NLI problems. Overall, we show that our attentive tree-based network achieves the highest performance among the baselines on monotonicity reasoning.

2.4.2 Ablation Test

To further analyze each component’s contribution to the model performance on monotonicity reasoning, we conduct several ablation tests. We first do an ablation test on the self-attentive aggregator by building the feature vector for classification right after the Tree-LSTM encoder. As Table 2 (–aggregator) shows, models trained on HELP+SubMNLI show a significant performance drop (6.6%) with a 76% drop in downward inference and a 10.9% drop in upward inference. The performance drop suggests that the self-attentive aggregator is an important component of the model for monotonicity reasoning. For the second ablation test, we replace the Tree-LSTM encoder with a standard LSTM encoder. Note that this results in a larger performance drop in upward inference (26.7%) and downward inference (14.1%). This demonstrates that replacing the Tree-LSTM with a standard one has a significant negative impact on the model’s reasoning ability for monotonicity. Thus, Tree-LSTM is also a major component of our proposed model. Overall, the removal of the Tree-LSTM encoder affected the model’s performance the most. Thus, we conclude that the Tree-LSTM encoder contributes the most to the model’s performance on monotonicity reasoning.

Table 2 This table shows the accuracy of ablation tests trained on HELP and HELP+SubMNLI and tested on MED. Three ablation tests were performed: (i) Remove self-attentive aggregator (–Self-attention), (ii) Replace Tree-LSTM with a regular sequential LSTM (–Tree-LSTM)

3 Neural-Symbolic Reasoning

3.1 Preliminary

Evaluation results for the Attentive Tree Network show that providing and enhancing structural knowledge of sentences is an effective way to improve neural models’ monotonicity reasoning ability. However, directly embedding symbolic logical information into a neural model is difficult. A better approach would be building a symbolic reasoning system incorporating neural modules into its inference process for better and more robust reasoning performance. Previously, several symbolic reasoning systems for NLI have been proposed Abzianidze (2017); Martínez-Gómez et al. (2017); Yanaka et al. (2018); Hu et al. (2020) to solve the NLI task based on symbolic rules and semantic formalism. These systems show high precision on complex inferences involving difficult linguistic phenomena and present logical and explainable reasoning processes. However, these systems show several limitations, such as lacking background knowledge and the inability to handle sentences with syntactic variations. On the other hand, new pre-trained language models are becoming more robust and accurate through improved pre-training objectives and data., enabling them to handle diverse and large test data robustly. However, several experiments show that DL models lack generalization ability, adopt fallible syntactic heuristics, and show exploitation of annotation artifacts Glockner et al. (2018); McCoy et al. (2019); Gururangan et al. (2018). We propose joining the strengths of these two types of systems into a hybrid reasoning system that can perform monotonicity reasoning.

3.2 Method

Our system contains four components: (1) a polarity annotator, (2) three sentence inference modules, and (3) a search engine. Figure 1 shows a diagram of the full system.

3.2.1 Polarity Annotator

To perform robust and accurate monotonicity reasoning, the system needs the annotations of monotonicity information on the given premises. To annotate monotonicity information, we utilize Udep2Mono Chen and Gao (2021), a polarity annotator that determines the monotonicity polarity of all constituents on a universal dependency tree. The annotator first parses the premise into a binarized universal dependency tree and then conducts polarization by recursively marking polarity on each tree node. The polarity marks include monotone (\(\uparrow \)), antitone (\(\downarrow \)), and no monotonicity information (=) polarities. An annotated example would be Every\(^\uparrow \) healthy \(^\downarrow \) person \(^\downarrow \) plays\(^\uparrow \) sports\(^\uparrow \). Where the monotone tokens are tagged with \(^\uparrow \) and antitone tokens are tagged with \(^\downarrow \).

3.2.2 Search Engine

Next, the polarized parse tree is passed to the search engine. A beam search algorithm searches for the optimal inference path from a premise to a hypothesis. During an inference step, we rank the generated sentences with a distance function and select the sentence with the minimum distance to proceed:

$$\begin{aligned} \mathbf {\textrm{s}}^\star = \mathop {\mathrm {arg\,min}}\limits _{\textrm{s} \in {\mathcal {S}}} \textrm{dist}(\textrm{s},\textrm{H}), \end{aligned}$$
(2)

where \(\textrm{H}\) is the hypothesis, \({\mathcal {S}}\) is a set of intermediate premises generated from the three inference modules, and \(\mathbf {\textrm{s}}\) is the optimal intermediate premise to continue the search that yields the minimal distance to the hypothesis. Here we formulate the distance function as the Euclidean Distance between the sentence embeddings of an intermediate premise and the hypothesis. The search space is generated from three inference modules: lexical, phrasal, and syntactic variation. In practice, we expand our search space on the top-k intermediate premises instead of the optimal ones. The system returns Entail if an inference path is found. Otherwise, the premise and hypothesis would be categorized as Non-Entail, where the controller will further search for counter-example signatures to differentiate between Contradict and Neutral. In this paper, we only analyze the system’s performance on the MED dataset (2-way classification: Entail and Non-Entail) and hence omit the details on how the system detects contradiction signatures.

Fig. 1
figure 1

Overview system diagram of NeuralLog, including (1) the polarity annotator, (2) the three inference modules, and (3) the beam search engine

3.2.3 Inference Generation

Lexical Monotonicity Inference

Lexical inference module performs word replacement on key tokens, including nouns, verbs, numbers, and quantifiers, based on monotonicity information. The system uses lexical knowledge bases, including WordNet Miller (1995) and ConceptNet Liu and Singh (2004). From the knowledge bases, we extract four sets of words: hypernyms, hyponyms, synonyms, and antonyms. Logically, if a word has a monotone polarity (\(\uparrow \)), it can be replaced by its hypernyms. For example, swim \(\le \) move; then swim can be replaced with move, where \(\le \) means that the left-hand-side word is a type of the right-hand-side word. If a word has an antitone polarity (\(\downarrow \)), it can be replaced by its hyponyms. For example, flower \(\ge \) rose. Then, flower can be replaced with rose, where \(\ge \) means that the right-hand-side word is a type of the left-hand-side word. We filter out irrelevant words from the knowledge bases that do not appear in the hypothesis. Additionally, we handcraft knowledge relations for words like quantifiers and prepositions that do not have sufficient taxonomies from knowledge bases. Some handcrafted relations that hold in general include: all = every = each \(\le \) most \(\le \) many \(\le \) several \(\le \) some = a, up \(\perp \) down, where \(=\) means that the two words are equivalent relations.

Phrasal Monotonicity Inference

Phrasal replacements are for phrase-level monotonicity inference. For example, with a polarized sentence A \(^\uparrow \) woman\(^\uparrow \) who\(^\uparrow \) is\(^\uparrow \) beautiful\(^\uparrow \) is\(^\uparrow \) walking\(^\uparrow \) in\(^\uparrow \) the\(^\uparrow \) rain\(^=\), the monotone mark \(^\uparrow \) on woman allows an upward inference: woman \(\sqsupseteq \) woman who is beautiful, in which the relative clause who is beautiful is deleted. The system follows a set of phrasal monotonicity inference rules. For upward monotonicity inference, modifiers of a word are deleted. For downward monotonicity inference, modifiers are inserted into a word. The algorithm traverses down a polarized UD parse tree, deletes the modifier sub-tree if a node is monotone (\(\uparrow \)), and inserts a new sub-tree if a node is antitone (\(\downarrow \)). To insert new modifiers, the algorithm extracts a list of potential modifiers associated with a node from a modifier dictionary. The modifier dictionary is derived from the hypothesis and contains word-modifier pairs for each dependency relation. Below is an example of a modifier dictionary from There are no beautiful flowers that open at night:

  • Obl: [head: open, mod: at night]

  • Amod: [head: flowers, mod: beautiful]

  • Acl:relcl: [head: flowers, mod: that open at night]

Syntactic Variation Inference

We categorize linguistic changes between a premise and a hypothesis that cannot be inferred from monotonicity information as syntactic variations. For example, a change from red rose to a rose which is red is a syntactic variation. Many logical systems rely on handcrafted rules and manual transformation to enable the system to perform syntactic variations. However, without accurate alignments between the two sentences, these methods are not robust enough and, thus, difficult to scale up for wide-coverage input. The recent development of pretrained transformer-based language models brings state-of-art performance on multiple benchmarks for Natural Language Understanding (NLU), including the task of paraphrase detection Devlin et al. (2019); Lan et al. (2020); Liu et al. (2020), which exemplifies phrasal knowledge of syntactic variation. We propose a method that incorporates transformer-based language models to handle syntactic variations robustly. Our method first decomposes both the premise and the hypothesis into chunks of phrases using a sentence chunker and then calculates the likelihood of each pair of chunks being a paraphrase using a transformer model.

Sequence Chunking

To obtain phrase-level chunks from a sentence, we build a sequence chunker, which relies on the sentence’s universal dependency information. Instead of breaking down a sentence, our chunker composes word tokens recursively to form meaningful chunks. First, we construct a sentence representation graph of a premise from the controller. A sentence representation graph is defined as \(\textrm{G} = \langle {\mathcal {V}}, {\mathcal {E}} \rangle \), where \({\mathcal {V}} = {\mathcal {V}}_{m} \cup {\mathcal {V}}_{c}\) is the set of modifiers (\({\mathcal {V}}_{m}\)) and content words (\({\mathcal {V}}_{c}\)), and \({\mathcal {E}}\) is the set of directed edges. To generate the chunk for a content word in \({\mathcal {V}}_{c}\), we arrange its modifiers, which are nodes it points to, together with the content word by their word orders in the original sentence to form a word chain, for example, in The woman in a pink dress is dancing. The edges from dress to in, a, pink with the edge from woman to dress can be drawn. Chunks in a pink dress and the woman in a pink dress will be generated for dress and woman, respectively.

Monolingual Phrase Alignment

Given a set of chunks from a generated sentence and from the hypothesis, the system computes an alignment score for each pair of chunks to select the syntactic variations. Formally, we define \({\mathcal {C}}_s\) as the set of chunks from a generated sentence and \({\mathcal {C}}_h\) as the set of chunks from the hypothesis. We build the Cartesian product from \({\mathcal {C}}_s\) and \({\mathcal {C}}_h\), denoted \({\mathcal {C}}_s \times {\mathcal {C}}_h\). For each chunk pair (\(\textrm{c}_{s}\), \(\textrm{c}_{h}) \in {\mathcal {C}}_s \times {\mathcal {C}}_h\), we compute an alignment score \(\varvec{\alpha }\):

$$\begin{aligned} \varvec{\alpha }_{\langle \mathbf {c_{s}}, \mathbf {c_{h}} \rangle }&= \textrm{p}(\textbf{y}\mid \langle \mathbf {c_{s}}, \mathbf {c_{h}} \rangle ) \end{aligned}$$

where \(\textbf{y}\mid {\langle \mathbf {c_{s}}, \mathbf {c_{h}} \rangle } = \textit{Softmax} ( \textrm{ALBERT} (\langle \mathbf {c_{s}}, \mathbf {c_{h}} \rangle ))\). If \(\varvec{\alpha } > 0.85\) (determined by a grid search of 5 values), the system records this pair of phrases as a syntactic variation. To calculate the alignment score, we use an ALBERT Lan et al. (2020) model, fine-tuned on the Microsoft Research Paraphrase Corpus Dolan and Brockett (2005). We first pass a chunk pair to ALBERT to obtain its logits. Then we apply a softmax function to the logits to get the final probability.

3.3 Evaluation

3.3.1 Experiment Setup

For Universal Dependency parsing, we follow Udep2Mono’s framework Chen and Gao (2021) and use a neural parsing model from Stanford’s Stanza Qi et al. (2020) with 90.0 LAS Zeman et al. (2018) evaluation score. We select the BERT-large model pre-trained on STS-B Cer et al. (2017) from Sentence-BERT Reimers and Gurevych (2019).Footnote 1 For ALBERT, we used an ALBERT-base model pretrained on the MRPC corpus. We evaluate our proposed reasoning system, NeuralLog, on the MED dataset for monotonicity reasoning. We compare our method with multiple deep-learning-based baselines. Here, DeComp and ESIM are trained on SNLI, and BERT is fine-tuned with MultiNLI. The BERT+ model is a BERT model fine-tuned on a combined training data with the HELP dataset, Yanaka et al. (2019), a set of augmentations for monotonicity reasoning, and the MultiNLI training set. Both models were tested in Yanaka et al. (2019). We also compare against the Attentive Tree Net we proposed in the first part to see if the neural-symbolic inference is a better choice than dedicated neural architecture and training data.

3.3.2 Results

As Table 3 shows, our system (NeuralLog) outperforms all the neural model baselines in terms of accuracy by a significant margin (48.7% maximum increase and 21.8% minimum increase). Compared to a prior neural-symbolic system, BERT+, our system performs much better both on the upward (15.4%) and downward (23.6%) inference. Compared to the Attentive Tree-structured Net for monotonicity reasoning, our neural-symbolic system still shows better performance with a significant margin of increase (\(\Delta \) 17.7%). This result highlights the point that using dedicated training data and neural architectures for monotonicity reasoning is not as effective as a neural-symbolic system that utilizes neural modules for intermediate reasoning. The good performance on MED validates our system’s ability on accurate and robust monotonicity-based inferences.

Table 3 Results comparing model compared to state-of-art NLI models evaluated on MED. Up, Down, and All stand for the accuracy of upward inference, downward inference, and the overall dataset

4 Large-scale Foundation Model

4.1 Preliminary

In the field of natural language processing (NLP), the use of large language models (LLMs) has significantly revolutionized how people approach reasoning and inference on language. It has been established that the effectiveness and efficiency of these models in various NLP applications can be improved by increasing their size, such as by increasing their training resources, the number of model parameters, and so on Wei et al. (2022). Self-supervised pre-training gives large-scale language models the ability to learn downstream tasks given no example or only a few input–output paired examples without optimization. Recent research shows emergent interest in uncovering the underlying logic of the aforementioned mysterious capacity of LLMs by empirical and theoretical approaches Rubin et al. (2021); Xie et al. (2021); Min et al. (2022); Ye and Durrett (2022). However, the current analysis of these LLMs still cannot answer if the unpredictable phenomena of emergent abilities of LLMs allow them to acquire the ability to simulate symbolic logic in natural language. In this section, we make an effort to benchmark various LLMs’ reasoning ability on monotonicity to gain some insights into the limitation of current LLMs.

4.2 Method

Zero-Shot Learning

Many studies show that large-scale language models exhibit zero-shot learning ability Kojima et al. (2022). The models can solve various NLP tasks by simply conditioning the instructions describing the task. We start our experiments on monotonicity reasoning using the setting of zero-shot learning. Specifically, we give the model a prompt Liu et al. (2021) in the format of Instruction: \(\langle \textrm{Instruction}\rangle \;\) Context: \( \langle \textrm{Context}\rangle \;\) Question: \(\langle \textrm{Question}\rangle \;\) Answer: \(\langle \textrm{Answer}\rangle \). The model then generates the \(\langle \textrm{Answer}\rangle \) tokens for the given problem by conditioning on this prompt. In zero-shot learning, the model cannot rely on any demonstrations but its parametric knowledge that is acquired during the pre-training stage, which is triggered by the prompt.

In-Context Learning

In-context learning for large language models is formulated as a text-generation problem. The generation is conditioned on a given prompt \(\textrm{p}\) which consists of the input problem x and k examples of input–output pairs:

$$\begin{aligned} p_{\textrm{LLM}}(\textrm{y} \mid \textrm{p}) = \prod _{t=1}^T \;p(\mathrm {y_{t}} \mid \textrm{p}, \mathrm {y_{<t}})\textit{,} \end{aligned}$$
(3)

where the prompt p contains several examples and the question to be answered: \(\textrm{p} = \{x_1, y_1,..., x_k, y_k, x\}\), and LLM is a large language model that can generate text in an auto-regressive way. According to Xie et al. (2021), the in-context learning ability of LLMs could be interpreted as an implicit Bayesian inference gained from the auto-regressive next-token generation task in the pre-training. The given input text, prompt \(\textrm{p}\), provides evidence of posterior distribution over task-related latent concepts \(\textrm{c}\) to infer the corresponding label \(\textrm{y}\):

$$\begin{aligned} p(\textrm{y} \mid \textrm{p}) = \int _{c} p(\textrm{y} \mid \textrm{c}, \textrm{p}) p(\textrm{c} \mid \textrm{p})d(\textrm{c}) \textit{.} \end{aligned}$$
(4)

In-context learning allows one to adapt LLMs to a different domain and downstream tasks without any fine-tuning. Because of its effectiveness and efficiency, we desire to investigate LLMs’ monotonicity reasoning capacity.

4.3 Evaluation

Setup

The evaluation focuses on assessing large language models’ reasoning ability with respect to monotonicity. We evaluate both the zero-shot learning setting and the few-shot in-context learning setting. When designing the prompt, we follow previous work on prompt-based multi-task learning Sanh et al. (2021) and build a Natural-Language-Inference-styled prompt. We include detailed instructions for the task and its label space to inject domain-specific understanding into models. We use the monotonicity reasoning test set from the CURRICULUM benchmark Chen and Gao (2022), a large-scale reasoning benchmark for evaluating broad-coverage linguistic phenomena. The monotonicity portion of the CURRICULUM benchmark integrates the MED dataset, the Semantic Fragments test sets Richardson et al. (2019), and 500 additional gold annotated monotonicity reasoning sentence pairs that are manually annotated and curated by human writers. Overall, this test set provides high-quality data, challenging problems, and analysis of powerful contextualized embedding language models. Thus, this test set allows us to conduct a more in-depth evaluation of modern large-scale language models. For in-context learning, we provide 4-shot, 8-shot, and 16-shot of examples to LLMs, respectively. For instance, in a 4-shot setting, 4 examples are randomly sampled from the training set for each label and concatenated to the prompt as a prefix. Each setting is evaluated 3 times, and the in-context examples are fixed every round to avoid the potential bias from example selection. We report the average performance across the 3 runs.

Baselines

For model selection, we pick LLMs with strong zero-shot learning abilities. The first type of models we select are LLMs that are continue fine-tuned in multi-task or instruction-tuning settings. We first report the baseline performance of the current SOTA NLI models, including RoBERTa Liu et al. (2019) and DeBERTa He et al. (2021). These two models are pre-trained bidirectional language models based on transformers and have shown impressive performance on NLI. These two models are fine-tuned on a mixture of common NLI training sets, including SNLI Bowman et al. (2015), MNLI Williams et al. (2018), FEVER Thorne et al. (2018), and ANLI Nie et al. (2020). We select FLAN-T5 Wei et al. (2021) as the instruction-tuned model. FLAN-T5 is a T5 Text2Text model trained using an instruction-based fine-tuning procedure on a collection of data sources with various instruction template types. FLAN with scaled parameters and training instructions shows strong zero-shot and few-shot learning abilities, outperforming prior public checkpoints. The second model type is LLMs, with a large parameter size (175 billion) showing the incredible ability for in-context learning Chung et al. (2022). We select the popular GPT-3 models from OpenAI. GPT-3 Brown et al. (2020) is a state-of-the-art auto-regressive language generation model. With 175 billion parameters and massive pre-training text data, it is currently one of the largest and most powerful language models in existence, capable of a wide range of natural language processing tasks. We include the original pre-trained GPT-3 model (text-davinci-001) and the GPT3.5 (text-davinci-003) model Ouyang et al. (2022), a version of the InstructGPT fine-tuned using reinforcement learning with reward models trained from human feedback (RLHF). GPT3.5 is much better at following the human intent in the instruction than the pre-trained version Ouyang et al. (2022).

4.4 Results

Table 4 shows the evaluation results for these models. Both GPT-3 and GPT\(-\)3.5 achieve only random performance(50%). GPT3.5 outperforms GPT3 by about 6% in every setting, but the performance is still far from proficiency in monotonicity reasoning. These low performances raise the question of whether LLMs’ can emulate logical reasoning expressed in natural language. Interestingly, instruction tuning with RLHF Ouyang et al. (2022) does not help the model substantially improve its understanding of logical inference, as shown in the overall low accuracy from GPT\(-\)3.5. On the other hand, compared to GPT-3, whose performance seems irrelevant with respect to the number of in-context examples, GPT3.5 shows consistent improvements as we give the model more examples, although such increases are still marginal. GPT\(-\)3.5’s performance gain from in-context learning is only trivial (0.4%). The results show that the in-context examples give the model certain levels of domain-specific task knowledge but fail to help the model fully learn the ability to perform monotonicity reasoning. GPT-3’s poor performance in the zero-shot setting and performance fluctuation among different in-context learning settings suggest that the model only learns the shallow structure knowledge about the task rather than the implicit reasoning skill. Regarding smaller instruction-tuned models, Flan-T5 outperforms GPT-3 and is comparable to GPT\(-\)3.5 in the 16-shot setting. However, its performance is still near random, suggesting that it understands the task better due to many instruction-fine-tuning tasks but still fails to learn the logical reasoning rules from the instructions. For smaller models, both RoBERTa and DeBERTa show near-random performances. Their lack of knowledge of monotonicity reasoning is expected as Chen and Gao (2022) showed that pretrained transformer-based models may not encode much monotonicity information during their pre-training process. Nevertheless, even fine-tuning with commonly used NLI training data still fails to benefit models’ performance on monotonicity reasoning. Such results lead to concerns about the learning quality of the models and the lack of logical reasoning samples in these common NLI datasets. Overall, we show that large language models still require a major effort to improve their reasoning ability on logic.

Table 4 Evaluation results for large language models (LLMs) on CURRICULUM’s monotonicity test set. Here M refers to million and B refers to billion for the number of parameters

5 Conclusions

In this paper, we provide an in-depth discussion of monotonicity reasoning in the age of neural foundation models. To summarize, we first propose the AttentiveTreeNet to investigate the effectiveness of incorporating structural knowledge and linguistic principles into neural architectures on monotonicity reasoning. Next, we propose a hybrid reasoning framework that utilizes both symbolic reasoning modules built from human-defined logical rules and neural language models to solve monotonicity NLI problems. For the third part, we analyze several popular and powerful large foundation models on monotonicity reasoning to verify if the ability to emulate logical reasoning has emerged in these massive neural models. Our evaluation focus on the MED benchmark and the CURRICULUM benchmark’s monotonicity section. Our analysis shows that injecting structural knowledge into advanced neural networks can largely improve the original network’s performance on monotonicity inference. However, performing reasoning jointly using symbolic and neural modules can further master the monotonicity reasoning task and achieve state-of-the-art performance while maintaining high interpretability. We show that large language models are far from mastering the skill of logical reasoning. Although popular models like InstructGPT can make powerful generations and predictions for various linguistic tasks and applications, they can only achieve a random performance on monotonicity reasoning. Overall, our work reveals the limitation of current large foundation models and sheds light on the new direction of approaching logical reasoning through neural-symbolic inference. For future work, it would be exciting to see symbolic reasoning systems built on top of large language models for complex logical reasoning tasks.