Reasoning About Truth in First-Order Logic

First, we describe a psychological experiment in which the participants were asked to determine whether sentences of first-order logic were true or false in finite graphs. Second, we define two proof systems for reasoning about truth and falsity in first-order logic. These proof systems feature explicit models of cognitive resources such as declarative memory, procedural memory, working memory, and sensory memory. Third, we describe a computer program that is used to find the smallest proofs in the aforementioned proof systems when capacity limits are put on the cognitive resources. Finally, we investigate the correlation between a number of mathematical complexity measures defined on graphs and sentences and some psychological complexity measures that were recorded in the experiment.


Introduction
What truths are humans able to identify in the case of first-order logic (FO) on finite models?This is a fundamental question of logic, linguistics, and psychology, which nevertheless remains largely unexplored.In this paper, we study this question systematically using psychological experiments and computational models.We combine elements of proof theory and cognitive psychology and extend earlier work on propositional logic (Strannegård et al. 2010).

Psychological Complexity
Human reasoning concerning FO truth lends itself perfectly to exploration using standard methods of experimental psychology.This holds also for other model-theoretic logics (Ebbinghaus 1985).Fix a vocabulary τ and let MOD be the set of finite τ -structures for which the domains are subsets of some fixed denumerable set.Also, let S E N be the set of FO τ -sentences and define A psychological experiment pertaining to Truth and a given participant can be set up by specifying the experiment's procedure and a finite set of test items TEST ⊂ MOD × SEN.An example of such a test item is given in Fig. 1.An experiment of this type yields response times and correctness data for the items of TEST.It also yields a set POS ⊆ TEST consisting of those elements of TEST that the participant classified as members of Truth.The sets POS and Truth generate a division of TEST into four subsets: true positives, false positives, true negatives, and false negatives.These experimental data can then be approximated using computational models, for which the predictive power can be evaluated using statistical measures such as the (Pearson) correlation.

Mathematical Complexity
What factors influence the difficulty for a human to determine whether M | A for a finite model M and FO sentence A? Are they the properties of the sentence, the model, or some combination of such properties?
The properties of the sentences are clearly important.For instance, the truth-value of ∃x E(x, x) is generally easier to determine than the truth-value of ∀x∃y E(x, y).Which properties of sentences are important here?Is it length, quantifier depth, parse tree depth, number of negations, type of connectives, a combination of these properties, or something else?
It is also clear that the properties of the models matter.For instance, the difficulty of determining the truth-value of ∀x∃y E(x, y) clearly depends on the properties of the model.Which properties of models are important in this connection?Is it cardinality, number of edges, some version of Kolmogorov complexity, a combination of these, or something else?One strategy for predicting the difficulty of determining truth is to combine n complexity measures on sentences with m complexity measures on models using some function f (e.g., a polynomial) of n + m variables.Despite the apparent generality of this strategy, it might be inadequate regardless of the choice of f .In fact, if the interplay between sentences and models is more intricate than that, then this strategy is bound to fail.
An entirely different strategy is to assume that truth-values are computed and that the properties of such computations are the most promising indicators of difficulty.This is the strategy we use in this paper.

Models of Human Reasoning
Cognitive architectures, such as SOAR (Laird et al. 1987), ACT-R (Anderson and Lebiere 1998), Clarion (Sun 2007), and Polyscheme (Cassimatis 2002), are computational models that have been used to model human reasoning in a wide range of domains.They typically include explicit models of cognitive resources, such as working memory, sensory memory, declarative memory, and procedural memory; these cognitive resources are known to be bounded in various ways, e.g., with respect to capacity, duration, and access time (Kosslyn and Smith 2006).

Models of Logical Reasoning
In cognitive psychology, several models of logical reasoning have been presented in the mental logic tradition (Braine and O'Brien 1998) and the mental models tradition (Johnson-Laird 1983).Computational models in the mental logic tradition are commonly based on natural deduction systems (Prawitz 1965); the PSYCOP system (Rips 1996) is one example.The analytical focus of the mental models tradition has been on exploring particular examples rather than developing general computational models.
Human reasoning in the domain of logic is considered from several perspectives in Adler and Rips (2008) and Holyoak and Morrison (2005).Stenning and van Lambalgen (2008) consider logical reasoning both in the laboratory and "in the wild" and investigate how problems formulated in natural language and situated in the real world are interpreted (reasoning to an interpretation) and then solved (reasoning from an interpretation).
Formalisms of logical reasoning include natural deduction (Prawitz 1965;Jaskowski 1934;Gentzen 1969), sequent calculus (Negri and von Plato 2001), natural deduction in sequent calculus style (Negri and von Plato 2001), natural deduction in linear style (Fitch 1952;Geuvers and Nederpelt 2004), and analytic tableaux (Smullyan 1995).Several formalisms have also emerged in the context of automated reasoning, e.g., Robinson's resolution system (2001) andStålmarck's system (2000).None of these formalisms represent working memory explicitly, and most of them were constructed for purposes other than modeling human reasoning.
We suspect that working memory could also be a critical cognitive resource in the special case of logical reasoning (Gilhooly et al. 1993;Hitch and Baddeley 1976;Toms et al. 1993).Therefore, we will define our own proof system that includes an explicit model of working memory.

Structure of the Paper
In Sect.2, we report the results of a psychological experiment concerning FO truth.In Sects. 3 and 4, we present proof systems for showing FO truth and FO falsity, respectively.In Sect.5, we present resource-bounded versions of these proof systems.In Sect.6, we present a number of mathematical complexity measures, some of which are defined in terms of the resource-bounded proof systems.In Sect.7, we compare the psychological complexity measures that were obtained in the experiment with the mathematical complexity measures.Section 8 contains a discussion and Sect.9, finally, presents some conclusions.

Experiment
In this section, we describe a psychological experiment concerning Truth.

Participants
The participants in our experiment were ten computer science students from Gothenburg, Sweden, whom were invited through email.These students had previously studied FO in their university education.They belonged to various nationalities, were aged 20-30 years, and included one woman and nine men.

Material
Fifty test items were prepared for the experiment.Each item consisted of a finite graph and an FO sentence.The task was to determine whether the sentence was true in the graph.A screenshot of the graphical user interface is shown in Fig. 1.The test comprised 24 true items and 26 false ones, which were presented in an order that was randomized for each participant.
The logical symbols used in the experiment were the quantifiers ∀ and ∃ and the connectives ¬, ∧, ∨, →, and ↔.The vocabulary τ included the binary predicate E(x, y) (for edges); the unary predicates Red(x), Blue(x), and Yellow(x) (for colors); and the constants 1, 2, 3, and 4 (for nodes).Let L(τ ) be the set of FO-formulas defined in the usual way over the vocabulary τ .
A total of 50 test items, consisting of pairs of graphs and L(τ )-sentences, were prepared.They were selected manually on the basis of estimated level of difficulty from randomly generated lists of graphs and sentences.All nodes were labeled with unique numbers and colored either yellow, red, or blue.Some of the test items are given in Appendix A and the full list can be found in Nizamani (2010).

Procedure
The experiment was conducted in a computer laboratory at the Department of Applied Information Technology, University of Gothenburg.Each participant was assigned an individual computer terminal.A short practice session was conducted before the test to familiarize the participants with some sample problems.The experiment was conducted in a 25-min session, followed by a 10-min break, followed by another 25-min session.
The answers and the response times were recorded for each participant and each item.Two aggregated complexity measures were computed for each item.The accuracy is the proportion of participants who answered the item correctly.The latency is the median response time among the participants who answered the item correctly.The median was used, rather than the mean, to reduce the effects of extreme response times due to external disturbances.

The Proof System T M
In this section, we describe a proof system T M for proving sentences to be true in a fixed finite model M. Please observe that this system is not a classical proof system in which only logical truths can be derived, but a system in which all truths in some fixed model M are derivable.
The system T M is a rather straightforward rewrite system with some modifications and extensions.The proofs are linear sequences of sentences.The system is local in the sense that checking whether a sequence of sentences is a proof can be done by only looking at two consecutive sentences at a time.The proofs are also goal-driven, which means that they start with a proof goal, i.e., the desired conclusion, and then use rules to reduce the goal to the true statement .
The proof system has two ingredients, axioms and rules.The axioms model declarative memory content and visual information about models and sentences, whereas the rules model the procedural memory.The main rule (Substitution) is based on the principle of compositionality, which is that meaning (and truth-value, in particular) is preserved under the substitution of logically equivalent components.The axioms are used as side-conditions to this rule.For example, substitution of A for B is only allowed if A ↔ B is an axiom.Substitution is a deep rule, which means that it allows subformulas appearing deep down in the parse-tree to be substituted.

Formulas
We shall now enrich our set of formulas L(τ ) by adding (i) bounded quantifiers ∀ Ω and ∃ Ω , where Ω is a set of constant symbols in τ and ∀ Ω x A has the intended meaning ∀x( c∈Ω x = c → A); (ii) abstraction boxes for modeling visual perception of formulas, A ; and (iii) contexts for assigning values to variables, Definition 1 (Formula) The set of L * (τ )-formulas is defined by the following BNF grammar: Definition 2 (Satisfaction relation) The satisfaction relation for formulas in L * (τ ) is defined in the standard Tarskian way with the following extra clauses, where s is an assignment: Next, we define what it means for a relation symbol R in a formula to be substituted by a formula.Definition 3 (Substitution) If R is a relation symbol of arity k, A a formula in L * (τ ∪ {R}) and B a formula in L * (τ ) with the free variables x = x 1 , . . ., x k , we define the substitution A[B( x)/R] to be the formula we obtain from A by substituting the leftmost occurrence of the symbol R with B[t 1 , . . ., t k /x 1 , . . ., x k ], where R occurs as R(t 1 , . . ., t k ) and t i is free for Observe that substitutions are not carried out inside abstraction boxes C .

Axioms
The axiom set Γ is constructed as the union of three sets: the logical axioms, the sentence axioms and the model axioms.

Logical Axioms
The set of logical axioms is denoted Γ T .The members of this set are the logical truths listed in Appendix B. This list was compiled from a number of textbooks on basic logic, including (Huth and Ryan 2004).

Sentence Axioms
The second ingredient of Γ comes from the idea to include parsing of a formula in the proof system itself.When we must check the truth of a formula, we first must parse the formula.This action is modeled in the system using abstraction boxes, which are considered black boxes that the agent cannot look inside.By using the rules of the proof system, the agent may unwind the formula one step at a time.Formally, we add one abstraction box for each formula in the language.The abstraction box corresponding to the formula A is denoted by A .These boxes are in many respects similar to atomic formulas; the free variables of the abstraction box A are defined to be the same as the free variables of A. This idea is closely related to the template logic introduced in Engström (2002), which was constructed to understand non-standard formulas in non-standard models of arithmetic.
In a bounded version of the proof system that will be introduced later, we will restrict the complexity of the formulas that may be used in a proof.Thus, the abstraction boxes may be used to reduce this complexity.The drawback of this reduction in complexity is that the substitution rule (see Sect. 3.3.1)may not substitute inside abstraction boxes.
To include parsing in the system, we add, for each pair of formulas A and B of L(τ ) and each variable x, the following axioms to the set Γ S of sentence axioms: Thus, by using these axioms and the substitution rule, we can replace a template symbol in a formula by a more complex formula.By repeating this procedure, we may eliminate all abstraction boxes from a formula, thus replacing A by A. However, this result is achieved only with a considerable increase in complexity.

Model Axioms
Let M be a finite structure in which all elements are named by some finite set of constants Ω 0 .To analyze the quantifiers in formulas, we require axioms that specify the range of quantifiers and the truth of quantifier-free sentences.If A is an L * (τ )formula (including abstraction boxes), we denote the corresponding formula without abstraction boxes by A .The set Γ M consists of the following formulas: 1.A and A ↔ , whenever A is a quantifier-free formula that is true in M. 2. ¬A and A ↔ ⊥, whenever A is a quantifier-free formula that is false in M.
In the above formulas, Ω is any set of constants.The set of axioms Γ is defined as the union of Γ T , Γ S , and Γ M (which depends on the model M).

Rules
In this subsection, we present the rules of T M .

Substitution
This is a deep rule in the sense that formulas can be substituted deep in the parse tree of a sentence.Only one occurrence of a formula can be substituted at a time: This rule may only be applied if C ↔ B ∈ Γ and FV(C) = FV(B).We label the rule by LE (Logical Equivalence), FI (Formula Inspect), or MI (Model Inspect) when the equivalence B ↔ C comes from Γ T , Γ S , or Γ M , respectively.
In the case of MI, we allow all contexts in A to be used for determining whether C ↔ B is in Γ M .I.e., if γ is the list of all contexts in A then B may be replaced by C if (Cγ ) ↔ (Bγ ) ∈ Γ M . 1 For example in a model where P(c) holds we may use MI to deduce In short, Substitution enables us to deduce A[C( x)/R] from A[B( x)/R] whenever C ↔ B ∈ Γ and C and B have the same free variables.

Strengthening
We may replace the goal to show B to be true with the goal to show A to be true whenever we know A → B to be a logical truth: B S A To apply this rule, we require that A → B ∈ Γ T .

Truth Recall
If we have derived a formula that we know to be a logical truth, we have succeeded: A TR To apply this rule, we require that A ∈ Γ T .
This concludes the definition of the proof system T M .By letting the first sentence in the sequence be A rather than A, we aim to model the idea that an individual who is presented with a sentence perceives the structure of the sentence gradually, starting from perceiving nothing and possibly ending before all the details have been perceived.Formula inspection can be used for gradually perceiving and parsing the sentence.

Proofs
As an example of a proof in T M , let us prove that the sentence given in Fig. 1 is true in the model given in the same figure: Note that the abstraction boxes may be removed in the beginning of the proofs by zooming in on the formula using iterated applications of Substitution (FI).Abstraction boxes are strictly speaking unnecessary at this point, but they will play an important role in Sect.5, in which bounded versions of proof systems will be considered.

Properties
The system T M includes simple models of the following cognitive resources: -declarative memory, which is modeled by the logical axioms in the set Γ T ; -procedural memory, which is modeled by the rules; -sensory memory, which is modeled as a buffer to hold sentence axioms from the set Γ S (modeling visual perception of sentence structure); -working memory, which is modeled as a buffer to hold temporary proof goals.Now, let us show the adequacy of T M for Truth.

Proposition 1 Suppose M ∈ MOD and A ∈ SEN. Then, M |
Proof Right-to-left is soundness: We prove that in any proof of T M , if A occurs, then M | A. This is done by induction starting from the bottom; the trivial base case is when A is .For the induction step, all we must check is that truth is preserved (in the sense that if A may be deduced from B and M | A, then M | B) by the rules in T M .This is straightforward.
Left-to-right is completeness: Suppose M | A. Note that by essentially using Substitution (FI) followed by Substitution (MI), we may reduce any sentence A to a propositional formula in which only and ⊥ are atomic formulas.Regardless of the details of this propositional formula, we can then use Substitution (LE) to produce strictly shorter formulas at each step until either or ⊥ is reached.Because M | A, however, the soundness of the system forces the proof to end with .

The Proof System F M
In this section, we describe a proof system F M for showing sentences to be false in a fixed finite model M. The formulas and axioms of F M are the same as for T M .

Rules
Now, let us define the rules of F M .

Substitution
This rule is identical to the Substitution rule of T M .

Weakening
We may replace the goal of showing B to be false with the goal of showing A to be false whenever we know B → A to be a logical truth: B W A To apply this rule, we require that B → A ∈ Γ T .

Falsity Recall
This rule makes use of a set Γ F , which contains contradictions only.We omit the exact definition of Γ F , which is a list of textbook contradictions in the style of Appendix B.
If we have derived a formula which we know to be a contradiction, we have succeeded: A FR ⊥ To apply this rule, we require that A ∈ Γ F .

Proposition 2 Suppose M ∈ MOD and A ∈ SEN. Then, M |
Proof This proof is analogous to the proof of Proposition 1.

The Bounded Proof Systems BT M and BF M
In this section, we define resource-bounded versions of the proof systems T M and F M .For this purpose, we require a precise definition of sentence length.
Definition 6 (Formula length) The length |A| of an L * (τ )-formula A is defined as follows: Now we can define our two bounded proof systems.
Definition 7 (BT M and BF M ) Let BT M be the proof system obtained from T M by adding the following restrictions on the proofs: -Working memory limit.The maximum length of a sentence that can appear in a proof is 8. -Sensory memory limit.The rule Substitution (FI) is only allowed on sentence axioms of Γ S of maximum length 7.
The proof system BF M is defined analogously from F M .

123
In the bounded proof systems, the abstraction boxes play an important role because they enable certain sentences to be provable that would not be provable without them.For instance, note that the FO sentence ∨ A, where |A| > 6 is valid and that this validity can be deduced without looking inside the right disjunct.This sentence is not allowed to appear in a proof in BT M , however, because its length exceeds 8.However, we have the following proof in BT M :

Mathematical Complexity Measures
In this section we define a number of mathematical complexity measures on items (M, A).The complexity measures are as follows (the ranges refer to the items appearing in the experiment).
3. Negation count.The number of negations appearing in A. Range: 0-2. 4. Cardinality.The number of nodes of M. Range: 3-4. 5. Edge count.The number of edges of M. Range: 3-6. 6. Linear combination.A linear combination of sentence length, cardinality, and sentence length * cardinality, whose coefficients can be fitted to experimental data.7. Working memory.The minimum size of the WM required for proving A in BT M or BF M .Range: 3-8.8. Proof length.The minimum number of steps of a proof of A in BT M or BF M .
Range: 4-38.9. Proof size.The minimum size of a proof of A in BT M or BF M , i.e. the minimum sum of the lengths of the formulas appearing in the proof.Range: 6-164.
Let us provide some motivation why these particular complexity measures were chosen.
Measures 1-5 are standard complexity measures in logic.Measure 6 (Linear combination) illustrates the possibility of combining such complexity measures in various ways, e.g.via polynomials.As suggested in Sect.1.2, there may be a priori arguments for believing that these complexity measures are inadequate for the present experiment.Measure 7 (Working memory) models the maximum strain on the working memory and measure 8 (Proof length) models the length of the shortest train of thought leading to the desired conclusion.
Measure 9 (Proof size) is based on the idea that latency in the context of logical reasoning depends on some notion of computational workload.By identifying the working memory load with formula length, we get proof size as a measure of computational workload.Thus measure 9 models the minimum amount of data that must flow through the working memory for the problem to be solved.We implemented automated theorem provers for the systems BT M and BF M in the functional programming language Haskell.Proofs of minimum length and minimum size were generated for all items appearing in the experiment.For instance, the proof appearing in Sect.3.4 was generated by our theorem prover, as a proof of minimum size.

Results
Table 1 shows the mean latency and mean accuracy recorded at the experiment.Table 2 shows correlations between latency and the mathematical complexity measures defined in Sect.6.Table 3 shows correlations between accuracy and the same complexity measures.

Discussion
In this paper, we studied human reasoning in the case of FO truth using standard methods of cognitive psychology.This reflects our view that there is nothing special about problem-solving in the domain of logic, and therefore, it can be explored using ordinary experimental methods and modeled using ordinary models of cognitive psychology that would apply equally well to mental arithmetic, Sudoku, or Minesweeper, for example.

Comments on the Experiment
In a preliminary investigation, we observed that the difficulty of determining truth was affected by the manner in which the models were presented graphically.For instance, determining whether a graph is complete seems to be simpler if the graph is drawn in a regular fashion.To mitigate this problem, we drew our graphs by placing the nodes equally distributed on a circle.
As is usual in experiments of this type, the answers given by the participants are potentially problematic because guesses, interaction errors (e.g., hitting the wrong button accidentally), and distractions in the experimental situation (e.g., coughing attacks) can affect the individual results substantially.Another potential problem relates to the instructions.In our particular experiment, we learned afterwards that some of the participants thought that if two variables x and y appeared in a sentence, then automatically x = y.Our instructions failed to address this point.
On the aggregate level, some of the problems on the individual level might partially cancel out, but then new problems arise on the modeling side.In fact, to model experimental data on the aggregated level, one must develop a computational model that represents some sort of average of the participants.Strictly speaking, this might not be possible using our present type of computational model, which was designed for cognitive modeling on the individual level.These factors should be borne in mind when evaluating the results.

Comments on the Proof Systems
The bounded proof systems described in Sect. 5 can be modified in several ways to suit different human role models.One way is to modify the axioms; a second is to modify the rules; a third is to change the working memory and sensory memory capacities; and a fourth is to add models of other cognitive resources.
Intuitively, proof-size reflects how comprehensive a thought must be, i.e., how much information must be processed to produce the correct answer.Proofs that require more information processing might take longer to process and be more prone to error.One may perhaps think of the smallest proofs as the smartest proofs, relatively to a given repertoire of cognitive resources.
As is often the case with cognitive modeling, this complexity measure can be criticized for being too coarse.For instance, it does not directly reflect the effort of searching for a proof; instead, it reflects the effort required to verify the steps in a proof that has been found.Although the efforts required for finding a proof and verifying the same proof may be somewhat correlated, this limitation certainly allows for future improvements to the model.

Comments on the Results
The data in Table 1 indicate that the True items were easier to solve than the False items.
Among the complexity measures analyzed in Table 2, Proof size has the highest correlation with latency, both for True and False items.Second comes the closely related complexity measure Proof length.This might indicate that complexity measures that are defined in terms of computations are more adequate than those defined in terms of standard properties of models and sentences.
Some of the complexity measures on sentences in Table 2 fared relatively well.In Sect. 1 it was argued that in general, properties of models can dramatically affect the difficulty of determining truth.The reason why those complexity measures on sentences were relatively successful in this particular case might be that the models that were used in the experiment were quite homogeneous.In fact, Cardinality varied between 3 and 4, and Edge count varied between 3 and 6.Therefore, the complexity measures that considered only sentences were perhaps not sufficiently challenged in the present experiment.
All of the complexity measures analyzed in Table 3 have relatively low correlations with accuracy.We do not know why the contrast to latency is so pronounced.The correlation values for Negation count stand out here and provide some support to the idea that negations increase the probability of error.

Conclusions
In this paper, we presented (i) the results of an experiment pertaining to Truth, (ii) two proof systems for deriving membership and non-membership in Truth using bounded cognitive resources, and (iii) an analysis of the correlation between psychological complexity measures and different mathematical complexity measures, including proof size.The results indicate that proof size was more successful than the other complexity measures in the case of latency.The approach that we use enables predictions about latency to be made for arbitrary elements of Truth.To evaluate the usefulness of this approach, which combines elements of proof theory and cognitive psychology, more experiments are needed, in particular experiments with more heterogeneous test items.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Appendix A: Experimental Data
To enable a deeper understanding of the experiment, let us list items 1-9 of the experiment and give their associated data.A full description of the items used can be found  Table 5 The models of items 1-9.On the top row, items 1-3 are shown, and so on in Nizamani (2010).Tables 4 and 5 list the sentences and models of items 1-9, respectively.Table 6 shows the data recorded at the experiment concerning items 1-9.Table 7, finally, shows the mathematical complexity measures of items 1-9.

Fig. 1
Fig. 1 Screenshot of the interface used for presenting the test items.The correct answer to the item shown is "Yes" and A i+1 follows from A i by one of the rules of T M .
and A i+1 follows from A i by one of the rules of F M .Now, let us prove the adequacy of F M for showing falsity in M.

Table 1
Mean latency and mean accuracy for true and false items

Table 6
Response times of the participants P1-P10 on items 1-9

Table 7
The mathematical complexity measures of items 1-9