1 Introduction

Question answering and inference using natural language is a classic A.I. area, with a long history of little success using symbolic methods, able to solve only small problems with a limited structure. The recent machine learning (ML) systems, in particular, the Large Language Model (LLM) implementations of the BERT and GPT families are, in contrast, often able to give satisfactory answers to nontrivial questions.

However, the current LLMs are neither trustworthy nor explainable. They have a well-known tendency of “hallucinating”, i.e. giving wrong answers and inventing actually nonexistent entities and facts. The problems of explicitly controlling the output and giving explanations for the solutions appear to be very hard for LLMs. An optimistic view of LLMs suggests that end-to-end learning can be improved to overcome these issues, while a more pessimistic view suggests that the problems are inherent and stem from the lack of an internal world model. The proponents of the latter view propose to build hybrid neurosymbolic systems, combining machine learning and symbolic methods of various kinds. Indeed, the research in the field of neurosymbolic systems has become quite active. The recent survey [14] points to a wider interest in connecting natural language systems to external software like databases and scientific calculations.

Using logic for natural language inference (NLI) in combination with ML may potentially alleviate the problems with LLMs and provide a glue to connect external systems to natural language interfaces. However, using logic directly for processing natural language is hard, for a number of reasons:

  • Semantic parsing, i.e. translating natural language to logic, is extremely hard due to the highly complex and exception-rich nature of natural language.

  • Existing knowledge bases of “common sense” do not cover a critical mass of the basic understanding of the world even a small child possesses.

  • Classical first order reasoning itself cannot cope with contradictory knowledge items, probabilistic or uncertain information and exceptions to rules.

  • Finding logic-based proofs often requires long proofs and the huge knowledge base causes a quick combinatorial explosion of the search space.

The motivation behind the research described in the paper is the following hypothesis: all the main problems described above can be alleviated by using ML techniques tailored separately for each particular problem. The current paper does not introduce any ML techniques for the problems above. The goal of our system is to serve as a backbone for research into combining the symbolic methods with ML. Our hypothesis is that by gradual improvement and combination of the existing symbolic subsystems with ML techniques it is possible to eventually build a question answering system which has enough power, trustworthiness and explainability to be practically useful in various application areas.

In other words, the envisioned end goal of this research is neither to replace LLMs nor to verify their output, but to develop systems combining LLMs and symbolic reasoning for specific areas where it is feasible to build sets of domain-specific rules and factual databases.

2 Related Work

Here we will only consider projects building a full NLP inference system. The performance of older pure symbolic or logic-based methods like LogAnswer [7] remained at the level of specific toy examples and never achieved capabilities required for wider applicability. The long-running CYC project [22], although having several successes, did not succeed with its original stated goals, which is often used as an argument against symbolic systems.

A popular area for language processing is converting human queries to SQL or SPARQL queries. These systems typically do not handle rules expressed in natural language. The projects closest to ours use reasoners with a relatively limited capacity, like BRAID [12], which uses extended SLD+ reasoner with probabilistic rules and fuzzy unification, CASPR [18], which uses an ASP reasoner incorporating default logic, NatPro [1, 2], which uses a Natural Logic prover. The latter is the only such project we know to be publicly available: https://github.com/kovvalsky/prove_SICK_NL.

The majority of research in neurosymbolic reasoning for natural language combines ML with weak forms of symbolic systems, typically taxonomies and triple graph knowledge bases like ConceptNet [25]. We approach the problem from the less common direction: starting from the symbolic/reasoning side and moving towards ML. There are already a few research projects combining ML with reasoning in quantified first order logic, although we are not aware of any such systems being publicly available. Noteworthy projects involving quantified logic are SQuARE [4], BRAID [12] and STAR [21]. The recent work in using large language models (LLM) mapping informal proofs to formal Isabelle [17] proof sketches guiding an automated prover [34] and using LLMs directly to generate Isabelle code [11] shows clear promise in combining LLMs with provers.

3 Natural Language Inference and Question Answering

The described pipeline is able to handle both the natural language inference (NLI) tasks (given a premise, determine whether a given hypothesis is true, false or indeterminate) and the closely related question answering tasks of finding a specific object matching a given criterion.

We will use a few simple examples throughout the paper. The expected answer to the first example “If an animal likes honey, then it is probably a bear. Most bears are big, although young bears are not big. John is an animal who likes honey. Mike is a young bear. Who is big?” is “Likely John”. The expected answer to the second example “The length of the red car is 4 m. The length of the black car is 5 m. The length of the red car is less than 5 m?” is “True”.

It is worth noting that these examples are solved correctly by the current (May 2023) versions of GPT: ChatGPT using the text-davinci-002 model and the API using the gpt-3.5-turbo and gpt-4 models: moreover, they are able to give a satisfactory explanation of the reasoning behind the answers. However, if we insert additional irrelevant information to the first example, our system still finds the expected answer, while none of the GPT models above give a correct answer: “If an animal likes honey, then it is probably a bear. Most bears are big, although young bears are not big. John is an animal who likes honey. Mike is a young bear. Mike can eat a lot. Penguins are birds who cannot fly. John took the block from the colored table. The table was really nice. The robot arm lifted a blue block from the table. Who is big?”.

Similarly, when we modify the second example by using meaningless words and adding irrelevant text, our system finds the expected answer, while all the referred GPT models give confusing answers: “The length of the barner is 200000000 m. The length of the red foozer is 312435 m. Most barners are 1000000 m long. Sun is larger than the moon. John saw the sun rising over an enormous foozer. A huge robot filled the sky. The length of the red foozer is less than 312546 m?” However, the answers given by GPT versions may vary over time, i.e. experiments with GPT are not reproducible.

4 The Question Answering Pipeline

Our system is publicly available at http://github.com/tammet/nlpsolver. It requires Linux and should be easy to install. The implementation consists of four main software systems. The pipeline driver calls the external Stanza parser [20] from Stanford, giving a Universal Dependencies (UD, see [5]) graph, then runs the semantic parser on the UD graph, calls the reasoner, and finally builds a natural language answer along with the explanation built from the proofs given by the reasoner. The pipeline driver, parser and answer construction components consist of over 400 Kbytes of Python code. Before running the solver, a small Python server component has to be started, to initialize the external UD parser Stanza and read a commonsense knowledge base into shared memory. For reasoning the pipeline calls our commonsense reasoner GK, written in C: this is the largest and the most complex part of the pipeline. There is a separate Python program for regression tests, along with several Python files containing sub-tests, currently over 1600 separate NLI tasks. The pipeline driver is called from a command line, with a natural language text and question as a command line argument, plus a number of optional arguments to control the behaviors like the amount of output.

4.1 Semantic Parsing

The parser takes English strings of natural language text as input and outputs extended clausified first-order logic formulas encoded in JSON as proposed in JSON-LD-LOGIC [29]. The main extension is adding numerical confidence to clauses and implementing default logic [23] by including special literals to encode exceptions, as presented in our papers [28] and [27].

Parsing consists of a number of phases, each adding new structural details to the results of the previous phases. For the most part, the phases are implemented procedurally, without using explicit transformation rules: we found that the more complex aspects of translation cannot be easily expressed with the help of simple transformation rules. In particular, the correct interpretation of a sentence depends heavily on previous sentences and a collected database of objects which have been talked about.

Conversion to Universal Dependencies (UD) Format. We use the external Stanza parser to get the UD format dependency graphs from input sentences. Stanza itself uses pretrained neural models. We first preprocess English strings to avoid several typical mistakes of the Stanza conversion, and then use Stanza to get the UD graph. The graph is then fed to our small set of simplifying transformations returning a simplified text, which is again fed to Stanza to get the final UD graph. The simplification phase reduces the amount of complexities and edge case handling necessary in the UD-to-logic converter, and is a prime candidate for experimenting with using LLMs for simplifications.

Converting UD to Logic. One of the strengths of UD representation given by Stanza is a high level of detail. The first subphase of conversion is restructuring the UD graph to a semi-logical representation explicating the outward logical structure around the subject/verb, object/verb or subject/verb/object tuples. The following subphases attach different kinds of properties to words. For example, the outmost structure constructed for the sentence “Most bears are big, although young bears are not big.” is

[and, svo[bear,be,big], svo[bear,be,big]] which is then extended to

[and, svo[bear,be,big], svo[[props,young,bear],be,big]].

The words in these structures are key-value objects containing both the initial UD information and additional details added during the phases.

The next subphase results in the extended logic in a non-clausified form, i.e. using explicit quantifiers. The conversion uses the previous structure recursively, taking into account the details of the original UD structure to find additional critical information like articles, negation, different kinds of quantifiers etc. We follow the approach of Davidsonian semantics, introducing event identification variables, while not taking the neo-Davidsonian path of splitting all relations to their minimal components (see [33])

For the coreference resolution we calculate the weighted heuristic scores for all candidate words, using also taxonomies of Wordnet. Another inherently complex task is determining whether a noun stands for a concrete object or should be quantified over. Importantly, any object detected is stored in a special data structure with new information about the object possibly added as the parsing process proceeds.

Let us consider an example sentence “John is a nice animal who likes honey.” It would be first converted to a conjunction of three formulas

$$ \begin{aligned}&\mathtt {isa(animal,c1\_John)} \\&\mathtt {prop(nice,c1\_John,generic,generic,ctxt(Pres,1))} \\&\mathtt {def0(c1\_John)} \\&\mathtt {\forall S \, (def0(c1\_John) \leftrightarrow } \\&\; \; \; \mathtt {\; \; \;\exists X \, isa(honey,X) \, \& \, (\exists A \, do2(like,c1\_John,X,A,ctxt(Pres,S))))} \end{aligned}$$

The system determined that in this sentence “John” refers to a concrete object and immediately created a Skolem constant \(\mathtt {c1\_John}\), storing it for possible later use and extension. Here it also created a new definition \(\texttt{def0}\) for encoding the complex property of “John”: liking honey. The properties of objects like given in the second formula above also encode the intensity of the property (slightly/very) and the comparative class: for example, saying “John is a very large animal ...” would create \(\mathtt {prop(large,c1\_John,3,animal,ctxt(Pres,1))}\). The constant \(\texttt{generic}\) indicates that intensity is not known or that the property is not comparative, i.e. does not relate to a specific class. The term \(\mathtt {ctxt(Pres,1)}\) encodes contextual aspects: the present tense and a concrete situation number in a possible sequence of situations created by different actions. The variable \(\texttt{A}\) in the last formula is an identifier of an action, which can be given additional properties, like place, time or assistive objects of an action, in the Davidsonian style.

In the representations above we have omitted the information about confidence and the possibility of exceptions. Indeed, the sentence we looked at is considered to be certain and without exceptions. However, the first part of the sentence “Most bears are big, although young bears are not big” attaches confidence 0.85 to the formula and includes a blocker literal encoding an exception in the sense of default logic, along with the comparative priority of the blocker:

$$\begin{aligned}&\mathtt {0.85: \, \forall X \, isa(bear,X) \rightarrow }\\&\; \; \; \mathtt {(prop(big,X,generic,generic,ctxt(Pres,1))) \vee }\\&\; \; \; \; \mathtt {block(h(bear,1), neg(prop(big,X,generic,generic,ctxt(Pres,1)))) } \end{aligned}$$

The blocker literals are used by the GK prover to recursively check the proof candidates found, with dimishing time limits: GK uses a part of a given time limit to attempt to prove each blocker literal in the proof. Whenever a blocker is proved, the candidate proof containing the blocker is considered invalid and thus discarded; see [27] for details.

The system is also able to handle simpler questions involving sizes of sets, like “An animal had two strong legs. The animal had a strong leg?”, “John has three big nice cars. John has two big cars?”, and measures, like “The length of the red car is 4 m. The length of the black car is 5 m. The length of the red car is less than 5 m?”. We use terms encoding the sets and measures: for example, the first sentence of the last question is translated to a formula containing a standard equality predicate, an integer and several properties involving the measure term, including the main statement \(\mathtt {4=count(measure1(length,c1\_car,meter,ctxt(Pres,1))}\)

Instance Generation. In order to answer questions without indicating concrete objects, like “Adult bears are large animals. Cats are small animals. Who is a large animal?” we need constants representing an anonymous instance of a class, essentially a “default adult bear”, a “default bear” and a “default cat”. For each such object the system generates a constant along with the formulas indicating its class and properties, enabling the system to produce an answer “An adult bear”.

Question Handling. Actual questions like “Who is big?” or “The length of the red car is less than 5 m?” require special handling. The automated reasoner GK used in the pipeline employs the well-known answer predicate technique to construct and output the required substitution term. All the variables in the question formula will be instantiated and output, potentially resulting in a large combination of different answers. The “Who is big?” question will be first translated to \(\mathtt {\exists X,Y,Z \, prop(big,X,generic,Y,Z)}\) indicating that we are not restricting the “bigness” or context in the question. However, we do not want to enumerate different “bigness” values or contexts in the answer, thus we wrap the formula into a definition (say, \(\texttt{def2}\) ) over a single variable \(\texttt{X}\), and search for different substitutions into \(\mathtt {def2(X)}\) only. Asking questions about location and time is implemented by constructing a number of questions over relations “near”, “on”, “at”, etc.

Clausification and Simplification. The system contains a clausifier skolemizing the formulas and converting these to a conjunctive normal form. The clausification phase also performs several simplifications, some of which are possible due to the known properties of the constructed formulas. Since nontrivial formulas may be converted into several clauses, the clausifier decides how to spread the numeric confidence of the formula and the exception literals in the formula into the clauses.

4.2 Integration with Knowledge Bases

The knowledge base provides the world model of our reasoning system. To answer the query “Tweety is a bird. Can Tweety fly?", the system needs to have the background knowledge that birds can fly. We construct the knowledge base (KB) using default logic rules augmented with numeric confidences. A small part of the knowledge base forms a core world model and is built by hand, while the bulk of the knowledge is integrated automatically from existing common sense knowledge (CSK) sources as described in [10].

We have integrated eight published knowledge graphs: ConceptNet [25], WebChild [30], Aristo TupleKB [15], Quasimodo [24], Ascent++ [16], UnCommonSense [3], ATOMIC\(^{20}_{20}\) [9] and ATOMIC\(^{10x}\) [32]. These CSK sources are collections of relation triples. The majority of the sources contain natural language clauses or fragments in the triple elements. We have built a specialized pattern matching semantic parser to convert the relations to first order logic rules with the default logic extensions and estimated numeric confidence. The full knowledge base contains 18.5 million rules, with over 15 million of those are related to taxonomy: inferring a property or an event from the class of an entity.

4.3 Automated Reasoning

We use our automated reasoner GK to solve the problems generated by semantic parser. The reasoner uses both the parser output and a selected subset of the world knowledge to solve the questions. Wordnet taxonomies are used to solve the precedence problem of exceptions. Large datasets are parsed, indexed and kept in shared memory for quick re-use. GK is built on top of a conventional high-performance resolution-based reasoner GKC [26] for conventional first order logic. Thus GK inherits most of the capabilities and algorithms of GKC. The main additional features of GK are following:

  • Using a well-known answer clause mechanism for finding a number of different answers, with a configurable limit.

  • Finding expected proofs even if a knowledge base is inconsistent. Basically, GK only accepts proofs which contain a clause originating from the question.

  • Searching for both a proof of the question and a negation of the question/negation of each concrete answer.

  • Estimating the numeric confidence in the statements derived from knowledge bases containing uncertain contrary and supporting evidence obtained from different sources.

  • Handling exceptions by implementing default logic via recursively deepening iterations of searches with diminishing time limits.

  • Performing reasoning by analogy via employing known similarity scores of words along with exceptions.

The first four features are covered in our previous paper [28] and the following two are covered in [27]. The word similarity handling is currently in an experimental phase: the initial experiments show that a naive implementation creates an unmanageable search space explosion, and thus a layered approach is necessary.

As a simple example of the basic features, consider sentences “John is nice. John is not nice. Mike is nice. Steve is not nice.” GK output to the parsed versions of the following questions will directly lead to these answers: “John is nice?”: “Unknown”, “Mike is nice?”: “True”, “Mike is not nice?”: “False”, “Who is nice?”: “Mike”, “Who is not nice?”: “Steve”. For a slightly more complex example, consider the earlier “If an animal likes honey, then it is probably a bear. Most bears are big, although young bears are not big. John is an animal who likes honey. Mike is a young bear. Who is big?”. GK will output the following proof in JSON, where we have removed quotation marks and a number of steps:

figure a

Observe that we get two answers. The following NLP pipeline step removes the generic [[$ans,some_bear]], since the more informative [[$ans,c1_John]] is available. Here both proofs contain only positive parts, although in the general case we may find both a positive and a negative proof, each with their own confidences. GK will throw away both the clauses produced during search and the final answers which have a summary confidence below a configurable threshold. GK will also throw away proofs which do not contain a goal clause. The confidences stemming from input sentences like “Most bears are big \(\ldots \)” are taken from our ad-hoc mapping of words like “most” to numeric values. By default, “normal” rule sentences are given a confidence below one and include a blocker literal for allowing exceptions.

The answers contain blocker literals, which have been recursively checked by separate proof searches before the final proof is accepted by GK. The details of these failed searches are not shown in the final proof. Had we included the sentence “John is not big” in our example, then the proof of the first blocker of the main answer would have been found, thus disqualifying the proof and leaving us with the final answer “Likely a bear.”.

4.4 Answers and Explanations in Natural Language

Answers and explanations are generated from the proof, with additional details taken from the database of objects along with their properties as detected during semantic parsing. While some of the principles were described in the previous section, there are two major tasks to perform: give a suitably detailed representation of objects in a proof (say, select between “a car”, “a red car”, “the red car”, “Mike’s car” etc.) and create a grammatically correct and easy-to-understand textual representation of clauses. The system translates clauses in a proof one-to-one to English sentences, as exemplified by the explanation generated from the previously presented proof:

figure b

5 Performance and the Test Set

The system has miserable performance on most well-known natural language inference or question answering benchmarks, the majority of which are oriented towards machine learning. As an exception, the performance on the anti-machine-learning question set HANS [13] is ca 95%, in contrast to the ca 60% performance of LLM systems before the GPT3 family (random choice would give 50% performance). The loss of 5% of HANS is due to the wrong UD parses chosen by Stanza.

However, the system is able to solve almost all of the demonstration examples of the Allen AI ProofWriter system https://proofwriter.apps.allenai.org/ and is able to solve inference problems the current LLM systems cannot, like the examples presented in the introduction. For regression testing we have built a set of ca 1600 simple questions with answers, structured over different types of capabilities. This test set may be of use for people working towards similar goals.

The runtime for the small examples presented in the paper is ca 0.5 s on a Linux laptop with a graphics card usable by Stanza. Of this time, Stanza UD parsing takes ca 0.17 s, UD to logic takes ca 0.04 s, and the rest is spent by the reasoner. For more complex examples the reasoner may spend unlimited time, i.e. the question is rather how complex questions can be solved in a preconfigured time window. In case the size of the input problem is relatively small and a tiny world model suffices for the solution, the correct answer is found in ca 1–2 s. However, in case the system is given a large knowledge base (KB) with a size of roughly one gigabyte, and the answer actually depends on the KB, then the search space may explode and the system may fail to find answer in a reasonable time. Efficiently handling a very large knowledge base clearly requires suitable heuristics based on the semantics and interdependence of rules/facts in the KB.

6 Towards a Hybrid Neurosymbolic System

Although the scope of the sentences successfully parsed and questions answered could be improved by adding more and more specialized cases to the current system, the cost/benefit ratio of this work would rapidly decrease. We’ll describe the most promising avenues of extending the system with ML hybridization as we currently see them.

Semantic Parsing. The two main approaches would be (a) end-to-end learning from sentences directly to extended logic as exemplified in [31], and (b) using existing LLMs or training specialized LLMs to perform simplification of sentences to the level where a hand-made semantic parser is able to convert the sentence to logic. Our initial experiments with the GPT models have shown that using a suitable prompt causes the LLMs to successfully split and simplify complex sentences.

Automated Reasoning. Despite being optimized for large knowledge bases and performing well in reasoning competitions on such problems, our system often fails to find nontrivial proofs in reasonable time in case a large knowledge base is used. The main approaches here would be (a) learning to find a proof, based on the experience of previous proofs (see [19] for an example), (b) using machine learning along with measures of semantic relatedness of formulas to the assumption and the question (see [6]) for an example), (c) using LLMs to predict intermediate results or relevant facts and rules. A significant boost in the terms of usability could be achieved by integrating external systems like databases and scientific computing with the automated reasoners.

The Knowledge Base. Publicly available knowledge bases do not focus on formalizing a basic world model, arguably critical for common-sense reasoning. It is possible that a core part needs to be built by hand. On the other hand, the existing knowledge bases along with large text corpuses can be extended by creating crucial new uncertain rules using both simpler statistical methods and more complex ML techniques: see [8] for a review.

7 Summary and Future Work

We have described an implementation of a full natural language inference and question answering pipeline built around an extended first order reasoner. The system is capable of understanding relatively simple sentences and giving reasonable answers to questions, including the types currently out of scope of the capabilities of LLMs. We plan to enhance the capabilities of the system by incorporating machine learning techniques to the components of pipeline, while keeping the overall architecture, including the semantic parser, word knowledge and a reasoner. At the time of this writing we are experimenting with using off-the-shelf LLMs without finetuning, but with a suitable prompt, to split and simplify complex sentences to a degree where our semantic parser is able to properly convert the meaning of the resulting sentences to logic.