Evaluating deep generative models on cognitive tasks: a case study

Tang, Zhisheng; Kejriwal, Mayank

doi:10.1007/s44163-023-00067-3

Evaluating deep generative models on cognitive tasks: a case study

Case Study
Open access
Published: 06 June 2023

Volume 3, article number 21, (2023)
Cite this article

Download PDF

You have full access to this open access article

Discover Artificial Intelligence Aims and scope Submit manuscript

Evaluating deep generative models on cognitive tasks: a case study

Download PDF

Zhisheng Tang¹ &
Mayank Kejriwal¹

1636 Accesses
Explore all metrics

Abstract

We present a detailed case study evaluating selective cognitive abilities (decision making and spatial reasoning) of two recently released generative transformer models, ChatGPT and DALL-E 2. Input prompts were constructed following neutral a priori guidelines, rather than adversarial intent. Post hoc qualitative analysis of the outputs shows that DALL-E 2 is able to generate at least one correct image for each spatial reasoning prompt, but most images generated are incorrect, even though the model seems to have a clear understanding of the objects mentioned in the prompt. Similarly, in evaluating ChatGPT on the rationality axioms developed under the classical Von Neumann-Morgenstern utility theorem, we find that, although it demonstrates some level of rational decision-making, many of its decisions violate at least one of the axioms even under reasonable constructions of preferences, bets, and decision-making prompts. ChatGPT’s outputs on such problems generally tended to be unpredictable: even as it made irrational decisions (or employed an incorrect reasoning process) for some simpler decision-making problems, it was able to draw correct conclusions for more complex bet structures. We briefly comment on the nuances and challenges involved in scaling up such a ‘cognitive’ evaluation or conducting it with a closed set of answer keys (‘ground truth’), given that these models are inherently generative and open-ended in responding to prompts.

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Article Open access 05 October 2023

Testing theory of mind in large language models and humans

Article Open access 20 May 2024

Demystifying ChatGPT: An In-depth Survey of OpenAI’s Robust Large Language Models

Article 18 June 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Background

The development and release of the attention-based transformer neural network architecture in 2017 have since led to a stunning explosion of such models [1]. An early popular example is the Bidirectional Encoder Representations from Transformer (BERT) model [2], which soon led to many domain-specific variants, as well as a more optimized version that was able to yield significant improvements without major changes to the original BERT architecture [3]. Perhaps because of its success, researchers have been attempting to empirically understand the properties (including biases and blind spots [4]) of even early transformer models, such as BERT, along multiple dimensions [5,6,7]. While these tests, some of which have been adversarial by design, have revealed some problems, a growing body of research also shows that these models have achieved truly impressive, non-incremental performance advances on various natural language understanding problems [8].

While it can be convenient to overweight mistakes by the models, especially if the mistakes are ‘un-humanlike’ and made in seemingly simple situations, and to dismiss them as incapable of semantics or symbolic processing, such commentating potentially opens the door to confirmation bias. We are not denying the utility of critical and adversarial testing of such models [9, 10]; however, we do caution that there is a danger of their interpretations being taken out of context. Arguably, the latest transformer models, such as ChatGPT and DALL-E, captured the public spotlight by being able to process relatively complex human inputs with unprecedented skill [11]. They have also ignited an AI arms race of sorts between large technology corporations. Some of this discourse is hyped, but some could be argued to be justified as correctly describing a major leap in AI progress, at least in an empirical sense [12, 13]. On the academic front, large language models have garnered interest well beyond industry, education, and application. In recent international conferences on cognitive science [14], for instance, they have been given their own sessions, and at least a few papers and abstracts from that community have attempted to study them from a cognitive perspective. For example, Collins et al. investigated the out-of-distribution reasoning ability in large language models (LLMs) [15], and found that LLMs are far less robust when dealing with out-of-distribution problems compared to humans. In another recent work, Tang and Kejriwal evaluated language models on rational decision-making problems using ‘bet’ questions [16]. They found that while language models can deal with bet questions if they are first fine-tuned on such questions, modification of the bet question without changing its fundamental characteristics can cause their performance to decrease by more than 25%, showing a lack of generalization.

With this emergence in the literature in mind, we propose that such models should rightfully be evaluated as ‘cognitive machines’ i.e., rather than probing the models in an ad-hoc fashion, we instead propose the use of systematic cognitive tests without a priori knowledge of how the model will perform, inspired by a long history of similarly evaluative work in both animal cognition and human psychology, often with a utility in mind, such as early detection of Alzheimer’s or spaceflight readiness [17,18,19]. We draw a distinction at the outset of claiming that such models are cognitive machines. Rather, we merely claim that they should be evaluated as such precisely because the extent of their ability to reason in a humanlike manner continues to be controversial and subject to the investigator and confirmation bias.

We note that systematic benchmarking of such models on applied AI tasks (e.g., question answering, text summarization) is already the norm in computer science, but has come under criticism itself for general AI problems like commonsense reasoning [20]. Unfortunately, a concordant degree of discipline has not been applied in studying these models qua cognitive machines. Rare examples of such systematic testing include^{Footnote 1} a rigorous examination of DALL-E 2 on compositionality and anaphora (among others) by Marcus et al. [24], and common syntactic processes by Leivada et al. [25]. Another example of comparing different model outputs is [21]. Others have attempted such studies for very specific problems like multiple-choice question answering [22, 23].

In this preliminary study, we conduct a case study evaluation by constructing and applying two cognitive tests (spatial reasoning and decision making). Both of these cognitive abilities are crucial for AI models to be successfully and reliably implemented in a range of real-world applications of a general nature. For example, spatial reasoning has been argued to be an important component for grounded language understanding [26], and rationality is important to verify in models that make decisions under uncertainty and with limited resources [27]. Spatial reasoning tests are more appropriate for a text-to-image generative model such as DALL-E 2, whereas the decision making test is more appropriate for models that can better express logic and abstraction in common language. Both tests require text as input, but we hypothesize that spatial reasoning is more directly tested through the production of visual output, while decision-making is better tested through contextualized conversation-style text output and is hence more appropriate for a large language model like ChatGPT:

Spatial reasoning: This test is comprised of a set of pairs of prompts, where each prompt expresses a spatial relationship between common objects, such as apples or oranges. The pair of prompts use the same objects and is largely indistinguishable in much of their surface prompts, but contains a single contrast e.g., a top relation versus a bottom relation between objects. We assess both the extent to which DALL-2 is able to understand the prompt in isolation and the difference in outputs between the two prompts, thereby aiming to test if the model is able to convincingly differentiate between the relations.
Decision-making: This test uses a dialectic sequence of prompts to understand whether the model obeys some of the axioms of Von-Neumann rationality (VNM), first elucidated under the Von Neumann-Morgenstern utility theorem [28]. They are a set of assumptions about the preferences of a rational decision-maker when faced with uncertain outcomes. We briefly describe the intuition behind the specific axioms that we test when presenting the results of the study. An example of a rationality-implying VNM axiom is transitivity: if I prefer A over B, and B over C, then it is necessarily implied that I must (under conditions of rationality) also prefer A over C. These axioms are considered particularly important in the formal decision sciences, although they are not always adequate to explain or justify human reasoning or motivation, as an ample body of literature in behavioral economics has shown.

In quantitatively analyzing the spatial reasoning results, we employ the VISOR (for Verifying Spatial Object Relationships) metric from [29]. VISOR measures the percentage of correctly generated images. It is computed by dividing the number of correctly generated images by the total number of generated images, and it can be computed both for individual prompts as well as across the full set of all prompts. We refer the reader to [29] for the full formal definition. We determined the correctness of the generated images given a prompt using our own judgment, but to ensure that we are not biased, we provided the generated images and the prompt to a third independent annotator. The annotations provided by this third annotator were found to agree perfectly with our own. In general, we designed the prompts for the spatial reasoning task to be fairly unambiguous.

Our main aim through this case study is to provide a methodology with which such AI models should be evaluated. We hope that the methodology itself can be used in the future to conduct a large-scale evaluation with multiple quantitative measures. In the last section of this paper, we comment on how the methodology could potentially be applied and scaled, and the evaluation issues that may arise in practice, in a larger-scale, more controlled version of this case study.

2 Pilot study

2.1 Spatial reasoning

Inspired by [25], where DALL-E 2 was subjected to the examination of eight grammatical phenomena pertaining to compositionality and each phenomenon was tested by pairs of prompts specifically targeted for that phenomenon, we construct and apply spatial reasoning prompts, to the DALL-E 2 model. Unlike [29], which focused on two-dimensional spatial relationships (i.e. left/right/above/below), we extend such spatial relationships to inside/outside, towards/away, and spatial distance. Below, the 10 pairs of prompts we used are enumerated, along with the DALL-E output (four images per prompt).^{Footnote 2} The experiment was conducted in December 2022 using a DALL-E browser-based interface that OpenAI had made available. The correct outputs, as graded by us, are identified using a green rectangle box. The remainder were deemed to be either incorrect or, at best, ambiguous.

Prompt 1/ Prompt 2	DALL-E output
An apple in front of an orange/ An apple behind an orange
An apple on the left of an orange/ An apple on the right of an orange
An apple on the bottom of an orange/ An apple on top of an orange
A black square inside of a red circle/ A black circle inside of a red square
A circle inside of a square/ A square inside of a circle
Top is black and bottom is red/ Top is red and bottom is black
Left side is black and right side is red/ Left side is red and right side is black
A person standing right in front of the Eiffel Tower/ A person standing 5 miles from the Eiffel Tower
A person with his face towards a car/ A person with his back towards a car
A person with his hand pointing towards a car/ A person with his hand pointing away from a car

Words were manually underlined in each pair of prompts to emphasize the key difference between them

Using the VISOR metric, described earlier, we found the VISOR accuracy (across all prompts) to be 40% for this experiment. A large-scale and automated evaluation can be conducted using similar architectures described in [29], and is left for future work.

In qualitatively analyzing the results, for the 10 pairs of prompts, we find that DALL-E 2 is able to generate at least one correct image per prompt and that in six out of the ten prompt pairs, it is able to generate a correct image for each of the prompts in the pair. An interesting range of ‘extreme’ behavior is also observed across some prompt pairs. For example, in response to the prompt ‘a person with his face towards a car’ the model produces images that are all reasonable; however, in response to its counterpart prompt ‘a person with his back towards a car’ its outputs are more ambiguous and based on an interpretation of the language that we did not intend (but that others might argue are reasonable, showing the importance of having multiple independent post-hoc annotations in a more complete study). Tellingly, there is no pair of prompts where all generated images are correct for both prompts.

2.2 Decision-making

Prior work from [23] focused mainly on evaluating rational decision-making abilities in the form of multiple-choice bet questions using discriminative language models, and hence only tested these models’ abilities to select the choice that maximizes the expected gain, given a closed set of choices. We extend this idea further to evaluate generative language models on established axioms and principles that a rational decision-maker should follow. Broadly defined, rational decision-making has been theorized to involve a number of different axioms, especially under the seminal Von Neumann-Morgenstern utility theorem. A full evaluation of a model like ChatGPT on these axioms is beyond the scope of this pilot study. Instead, we evaluate the model on two axioms: the independence axiom and the transitivity axiom. We also evaluate other (non-VNM related) decision-making abilities by presenting to the model, bets that are conditioned on random events, bets with asymmetric outcomes, decisions encapsulating Savage’s Sure Thing principle, and other complex bet structures like nested bets.

Unlike other work that has sought to evaluate ChatGPT in an adversarial fashion, we construct our problems in a simple and uniform language without any intent of ‘tricking’ it. We devised all the questions ourselves with inspiration from [23]. All prompts and responses are presented as is; in other words, prompts are not adjusted intentionally or dynamically to obtain a mal-response. Each experiment is conducted in a single, short session which usually comprises a few prompts presented in sequence. Each subsection below is relatively self-contained.^{Footnote 3}

2.2.1 Evaluating ChatGPT on the VNM independence axiom

In the Background, we mentioned transitivity as an important VNM-rationality axiom. Another such axiom is independence, which concerns ‘compound’ lotteries^{Footnote 4} and can be formally expressed as saying that, for any lottery N and real-valued p ∊ (0,1]:

$$L \preceq M \quad iff \, pL+(1-p)N \preceq pM+(1-p)N$$

Here, L and M are also lotteries. As the equation suggests, if I prefer M over L (independent of another lottery N), then I should continue to prefer M over L if both are combined in the same way with any other lottery N (thereby yielding a compound lottery). To test ChatGPT for independence, we first state our preference regarding two items, then construct two different bet questions with equally likely outcomes that differ only in one of the two outcomes of each bet. We ask ChatGPT about which one we should choose.

Prompts	ChatGPT answers	Comment
P1: if i prefer watch over egg. if I am given a bet that has 50% chance to win a watch and 50% chance to win a computer and another bet that has 50% chance to win a egg and 50% chance to win a computer, which bet should I take?	ChatGPT1: Based on your preference, you should take the bet that has a 50% chance of winning a watch and 50% chance of winning a computer. This bet aligns with your preference of a watch over an egg and also offers the chance of winning a computer, which could be seen as a bonus
P2: if i prefer egg over watch. if I am given a bet that has 50% chance to win a watch and 50% chance to win a computer and another bet that has 50% chance to win a egg and 50% chance to win a computer, which bet should I take?	ChatGPT2: Based on your preference, you should take the bet that has a 50% chance of winning an egg and 50% chance of winning a computer. This bet aligns with your preference of an egg over a watch and also offers the chance of winning a computer, which could be seen as a bonus	ChatGPT shows a reasoning process consistent with the independence axiom and comes to the right (rational) conclusion. Even when we indicate our preference to be contrary to the commonsensical ones, ChatGPT can still reason using this fake preference and return a rational decision

The result above suggests that ChatGPT’s responses are aligned somewhat with the independence axiom. As a robustness check, even when we indicate our preference to be contrary to the commonsensical ones, ChatGPT can still reason using this fake preference.

Furthermore, ChatGPT seems to understand from the context that we want to maximize our expected gain, even when we do not say so explicitly.

2.2.2 Evaluating ChatGPT on the VNM transitivity axiom

To test ChatGPT on VNM transitivity, we construct three bets, all of which have equally likely but different outcomes. We state our preference among the three bets. Our preference always favors one bet over another bet and favors the other bet over yet another bet. Hence, we have a clear preference for the first bet over the last one. Then, we ask ChatGPT for our preference involving the first and last bet. The first prompt may contain preferences that violate common sense, such as preferring an egg over a TV. The preference in the second prompt aligns with commonsense.

Prompts	ChatGPT answers	Comment
P1: if i prefer bet A that has 50% chance to win a watch and 50% chance to win a computer over bet B that has 50% chance to win a egg and 50% chance to win a computer, and i prefer bet B over bet C that has 50% chance to win a tv and 50% chance to win a computer. What is my preference involving bet A and bet C?	ChatGPT1: Based on your preferences, we can conclude that you prefer the option of winning a computer over winning any other item in all three bets. So, it can be said that you don't have a preference between bet A and bet C as both offer a 50% chance of winning a computer	Both interactions show reasoning and conclusions that violate the transitivity axiom
P2: if i prefer bet A that has 50% chance to win a car and 50% chance to win a computer over bet B that has 50% chance to win a tv and 50% chance to win a computer, and i prefer bet B over bet C that has 50% chance to win a egg and 50% chance to win a computer. What is my preference involving bet A and bet C?	ChatGPT2: Based on your preferences, it can be concluded that you prefer the option of winning a car in bet A over winning a TV in bet B and an egg in bet C. However, as both bet A and bet C offer a 50% chance of winning a computer, it can be said that you do not have a preference between bet A and bet C

Unlike the test with the independence axiom, the result above suggests that the transitivity axiom is more challenging for ChatGPT. In both prompts, we state our preference for bet A over bet B, and for bet B over bet C. Following the transitivity axiom, ChatGPT should reasonably infer that we prefer bet A over bet C. However, not only does it get the wrong answer (that we do not have a preference for bet A over bet C) in both cases but its reasoning is also incorrect. In the first answer, ChatGPT says that ‘you prefer the option of winning a computer over winning any other item in all three bets’ which seems superficially correct, but is specious (and incorrect). What should be inferred from the first prompt is that we prefer winning a watch over winning an egg, and over winning a TV. Given the second prompt, ChatGPT does identify this preference relationship between the three items. However, it again mentions computers, which is irrelevant in the context of this problem.

2.2.3 Conditioning bets on random events

In the first two experiments, we state the probabilities of possible outcomes explicitly. In the next experiment, we condition the probabilities on the result of a random event, such as the outcome of a coin flip. The reason is to control for the possibility that ChatGPT may simply not be equipped to handle ‘explicit’ probability estimates robustly. To test whether ChatGPT can choose the outcome with the highest expected gain, we start with the simplest formulation. We first confirm that ChatGPT’s preference for an egg and a watch is aligned with those items’ (commonsense) economic value. This is the basis for our further evaluation as it would allow us to treat the preference as given when evaluating rationality. Then, we construct a bet question with equally likely outcomes and a net positive expected gain. We ask ChatGPT which outcome we should bet on to maximize the expected gain. We ask once for each possible outcome.

Prompts	ChatGPT answers	Comment
P1: does a watch usually cost more than an egg?	ChatGPT1: The cost of a watch can vary greatly depending on many factors such as the brand, materials used, and features it has. Some watches can cost just a few dollars while others can cost tens of thousands of dollars. On the other hand, the average cost of an egg is relatively low, often just a few cents. So, it's difficult to make a general comparison between the cost of a watch and an egg. However, it's safe to say that in most cases, a watch will cost more than an egg	ChatGPT shares the same preference for a watch and an egg as an average human. However, for all three prompts, ChatGPT’s reasoning seems to be incorrect and it made the wrong (irrational) decision
P2: A fair coin is flipped. If the coin comes up heads, then I win a watch. If it comes up tails, then I lose a egg. should I bet on heads to maximize my expected gains?	ChatGPT2: In this scenario, betting on heads would not be a wise decision to maximize your expected gains, as the potential rewards and losses are not equal. The expected gain from betting on heads is 0, as the probability of winning a watch (which is worth more than an egg) is exactly the same as the probability of losing an egg. On average, you would break even if you bet on heads every time It's important to consider the potential outcomes and their corresponding probabilities when making decisions, in order to make informed choices that maximize your expected gains. In this case, betting on heads would not result in any expected gain, so it would not be a good choice
P3: A fair coin is flipped. If the coin comes up heads, then I win a watch. If it comes up tails, then I lose a egg. should I bet on tails to maximize my expected gains?	ChatGPT3: No, betting on tails would not be a wise decision to maximize your expected gains either. The expected gain from betting on tails would also be 0, as the probability of losing a watch (which is worth more than an egg) is exactly the same as the probability of winning an egg. On average, you would break even if you bet on tails every time In this scenario, there is no way to maximize your expected gains, as the rewards and losses are not equal and the coin flip is fair. You would have an equal probability of winning a watch or losing an egg, and on average, you would break even over many coin flips. It's best to not make a bet in this case, as you would simply be exchanging one item for another with no expected change in your overall value
P4: A fair coin is flipped. If the coin comes up heads, then I win a watch. If it comes up tails, then I lose a egg. should I not bet to maximize my expected gains?	ChatGPT4: Yes, in this scenario, not making a bet would be the best choice to maximize your expected gains. Since the potential rewards and losses are not equal (a watch is worth more than an egg), betting on either heads or tails would result in an expected loss. On average, you would break even if you made either bet every time, but you would not have an expected gain In this case, not making a bet would allow you to avoid any potential losses and maintain your current possessions (neither a watch nor an egg). Not betting would be the best way to maximize your expected gains, as you would avoid any exchange that could result in a net loss

The results above suggest that ChatGPT’s reasoning and decision are both incorrect. In the second response, ChatGPT responded that the potential rewards and losses are not equal, which is correct. However, in the next sentence, it says ‘the expected gain from betting on heads is 0, as the probability of winning a watch (which is worth more than an egg) is exactly the same as the probability of losing an egg.’ This is incorrect because even though the probability of winning a watch is the same as the probability of losing an egg, the expected gain from betting on heads is not zero. In the next two responses, ChatGPT continues to make such mistakes which leads to incorrect decisions.

In the previous experiment, prompts were only concerned with one bet. To test instead whether ChatGPT can identify the bet (or ‘lottery’) with a higher expected gain when presented with two bets, we construct two bets with equally likely outcomes. One of the bets has a positive net expected gain while the other has a negative net expected gain. In principle, this is a relatively easy decision. We ask ChatGPT about which bet we should choose. In the first prompt, abstract items are involved. In the second prompt, actual monetary value is used. We asked both versions, as there is some chance the model is better able to reason probabilistically when given one over the other.

Prompts	ChatGPT answers	Comment
P1: if I am given a bet that If a coin comes up heads, then I win a watch. If it comes up tails, then I lose a egg. And another bet that If a coin comes up heads, then I win a egg. If it comes up tails, then I lose a watch. which bet should I take?	ChatGPT1: In this scenario, both bets have the same expected value. The expected value is the sum of the product of the outcome and its corresponding probability. For both bets, the probability of heads and tails are equal (1/2), and the outcomes of winning a watch or an egg are also equal in value Therefore, both bets have the same expected value, and it doesn't matter which bet you take. On average, you would break even over many coin flips. It's important to note that expected value is just an average and does not guarantee a particular outcome in each individual flip of the coin. It is possible to win a watch in one bet and lose a watch in another, but on average over many coin flips, you would break even	In the first response, ChatGPT gives the false statement that the two bets have the same expected value and comes to the wrong conclusion. In the second response, ChatGPT correctly identified the correct choice with the correct reasoning process
P2: if I am given a bet that If a coin comes up heads, then I win a $100. If it comes up tails, then I lose a $1. And another bet that If a coin comes up heads, then I win a $1. If it comes up tails, then I lose a $100. which bet should I take?	ChatGPT2: In this scenario, it would be best to take the first bet (win $100 and lose $1). This bet has a higher expected value compared to the second bet (win $1 and lose $100) The expected value of the first bet is (1/2) * $100 + (1/2) * − $1 = $49.50, which is positive and represents the average amount you would expect to win over many coin flips The expected value of the second bet is (1/2) * $1 + (1/2) * − $100 = − $49.50, which is negative and represents the average amount you would expect to lose over many coin flips Therefore, taking the first bet would have a higher expected value, and would give you a better chance of making a profit in the long run compared to the second bet

The results above suggest, interestingly, that ChatGPT is able to handle this type of selection problem better when it is given monetary values explicitly, rather than pairs of items where the values have to be inferred. Nevertheless, as had already been found for one of the previous prompts, ChatGPT has a preference for a watch over an egg. Rationally, it should also have selected the positive expectancy bet for the first prompt above. It is likely that corpora that contain problems with monetary (or numerical) values are abundant in the training data of ChatGPT, whereas corpora that contain problems with common items are less abundant. The observed variation in the behavior of ChatGPT when presented with two problems sharing the same underlying logic may be attributed to this dissimilarity within its training data. However, we often deal with actual objects (items) instead of money in real life. Such inconsistency in the behavior of ChatGPT can pose a serious challenge if we implement it in practical applications.

In the previous prompt, we implicitly assumed that we prefer a watch over an egg, as an average person would have. What if we state our preference explicitly? In this experiment, we do so by stating our preference for a watch over an egg. Then, we pose to ChatGPT the same prompt as in the previous experiment, which it got wrong. In the second prompt, we switch our preference and ask the same question again.

Prompt	ChatGPT answer	Comment
`P1: if i prefer watch over egg. if I am given a bet that If a coin comes up heads, then I win a watch. If it comes up tails, then I lose a egg. And another bet that If a coin comes up heads, then I win a egg. If it comes up tails, then I lose a watch. which bet should I take?	ChatGPT1: If you prefer a watch over an egg, you should take the first bet, where if a coin comes up heads, you win a watch and if it comes up tails, you lose an egg In this case, if the coin comes up heads, you get what you prefer (a watch) and if it comes up tails, you lose something that you value less (an egg). So, this bet gives you the best chance of getting what you want	ChatGPT shows the correct reasoning process and gives the correct conclusion for both prompts, even when we switch our preferences
P2: if i prefer egg over watch. if I am given a bet that If a coin comes up heads, then I win a watch. If it comes up tails, then I lose a egg. And another bet that If a coin comes up heads, then I win a egg. If it comes up tails, then I lose a watch. which bet should I take?	ChatGPT2: If you prefer an egg over a watch, then you should take the second bet, where if a coin comes up heads, you win an egg and if it comes up tails, you lose a watch In this case, if the coin comes up heads, you get what you prefer (an egg) and if it comes up tails, you lose something that you value less (a watch). So, this bet gives you the best chance of getting what you want

Interestingly, stating the preference explicitly now leads to the model employing the correct reasoning process, and giving the correct answer for both prompts. The model would thus seem to be robust in some ways. This particular experiment would seem to confirm the importance of ‘prompt engineering’ in such experiments, even for models as linguistically capable as ChatGPT.

2.2.4 Bets with asymmetric probabilities

All previous experiments are concerned with bets with equally likely outcomes. To test ChatGPT on bets with outcomes that are not equally likely, we construct a bet question with asymmetric probabilities and then ask ChatGPT about how we should act to maximize expected gain.

Prompts	ChatGPT answers	Comment
P1: If I roll a die and it comes up with a number that is less than 2, then I get a watch. If it comes up with a number that is greater or equal to 2, then I give an egg? What should I do to maximize my expected gains?	ChatGPT1: Since a standard die has six sides, each side has an equal probability of showing up. So, the probability of rolling a number less than 2 is 1/6, and the probability of rolling a number greater than or equal to 2 is 5/6 If you roll a number less than 2, you get a watch, which we can assume has a positive value for you. On the other hand, if you roll a number greater than or equal to 2, you must give away an egg, which we can assume has a negative value for you To maximize your expected gains, you should choose the option with the higher expected value. In this case, giving away an egg has a higher expected value because it happens 5 out of 6 times, so your expected loss would be 5/6 times the value of the egg. On the other hand, getting a watch has a lower expected value because it only happens 1 out of 6 times, so your expected gain would be 1/6 times the value of the watch Therefore, to maximize your expected gains, you should roll the die and, if the number is less than 2, take the watch, and if the number is greater or equal to 2, give away the egg	ChatGPT can determine the correct probability for the bet. However, ChatGPT’s claim that ‘giving away an egg has a higher expected value’ is wrong. Although part of its conclusion is ultimately correct, it is not complete

The response to this prompt is ambiguous. Even though ChatGPT can correctly reason about the asymmetric probability and part of ChatGPT’s conclusion is correct (that we should roll the die to maximize the expected gain), it is derived from an incorrect intermediate step (that giving away an egg has a higher expected value and getting a watch has a lower expected value).

2.2.5 Savage’s sure thing principle

Beyond the VNM axioms, another important principle concerning decision-making is Savage’s Sure Thing principle (first elucidated in 1954; see Pearl [30] for an accessible description), which states that [Let f and g be any two lotteries], if a person prefers f to g, either knowing that the event B obtained, or knowing that the event not-B obtained, then he should prefer f to g even if he knows nothing about B. To test ChatGPT for the Sure Thing principle, we first state our preference for two bets. Each of the two bets contains the same likelihood of winning an identical item (30% for the egg in the first prompt). Then we ask ChatGPT for our preference over another two bets. These two bets differ from the original bets only in the aforementioned item. By the Sure Thing principle, our preferences should not change. The second prompt changes abstract items to actual values.

Prompt	ChatGPT answer	Comment
P1: if I prefer bet A which has 60% chance to win a car, 10% chance to win a coffee, and 30% chance to win an egg to bet B which has 60% chance to win a computer, 10% chance to win nothing, and 30% chance to win an egg. I was given bet C which has 60% chance to win a car, 10% chance to win a coffee, and 30% chance to win a diamond. And another bet D which has 60% chance to win a computer, 10% chance to win nothing, and 30% chance to win a diamond. what is my preference involving bet C and bet D?	ChatGPT1: If you prefer bet A to bet B, it means that you value the prospects of winning a car and a coffee more highly than the prospects of winning a computer. In that case, your preference between bet C and bet D would depend on whether you value diamonds more highly than either cars or computers If you value diamonds more highly than cars, then you would prefer bet C, as it gives you a higher chance of winning a diamond. Similarly, if you value diamonds more highly than computers, then you would prefer bet D However, if you value diamonds less highly than both cars and computers, then you would have to make a decision based on which of the two you value more highly. If you value cars more highly than computers, you would prefer bet C, and if you value computers more highly than cars, you would prefer bet D	ChatGPT gives incorrect reasoning and conclusion. For the second prompt, its application of the Sure Thing principle is more ambiguous
P2: if I prefer bet A which has 60% chance to win $1000, 10% chance to win $0, and 30% chance to win $0 to bet B which has 60% chance to win $500, 10% chance to win $300, and 30% chance to win $0. I was given bet C which has 60% chance to win $1000, 10% chance to win $0, and 30% chance to win $1000. And another bet D which has 60% chance to win $500, 10% chance to win $300, and 30% chance to win $1000. what is my preference involving bet C and bet D?	ChatGPT2: If you prefer bet A to bet B, it means that you value the prospects of winning $1000 more highly than the prospects of winning a combination of $500 and $300 In that case, your preference between bet C and bet D would depend on whether you value the chance of winning $1000 more highly than the combination of $500 and $300 If you value the chance of winning $1000 more highly, then you would prefer bet C, as it gives you a higher chance of winning $1000. Similarly, if you value the chance of winning $1000 more highly, then you would prefer bet D However, if you value the combination of $500 and $300 more highly than the chance of winning $1000, then you would have to make a decision based on which of the two you value more highly. If you value bet C more highly, you would prefer bet D, and if you value bet D more highly, you would prefer bet C

ChatGPT’s response to the first prompt does not seem to follow the Sure Thing principle, giving both incorrect reasoning and an incorrect conclusion. The response to the second prompt is more ambiguous. In its first response sentence, ChatGPT gives a correct statement that we prefer winning $1000 more than winning a combination of $500 and $300. In the middle of the response, it is unclear which $1000 ChatGPT refers to. In the last sentence, ChatGPT gives an incorrect conclusion that if we value bet C more highly, we would prefer bet D and vice versa.

2.2.6 Other complex betting structures

Finally, we note that the bets we have presented thus far follow a very simple and relatively symmetric structure. In decision science, it is possible to construct increasingly complicated bet questions, such as a pair of bets with different likelihoods of outcomes, and bets with another bet as one of their outcomes (nested bets). To test ChatGPT on such bets, we construct two prompts: the first prompt contains two bets with three outcomes with different likelihoods of occurring. The second prompt contains a nested bet.

Prompt	ChatGPT answer	Comment
P1: if I am given a bet that has 50% chance to win $1000, 30% chance to lose $500, and 20% chance to lose $100 and another bet that has 50% chance to win $2000, 40% chance to lose $500, and 10% chance to win $300, which bet should I take? \	ChatGPT1: To calculate the expected value of each bet, you can multiply the amount of each outcome by its corresponding probability, and then sum the results For the first bet: • 50% chance to win $1000, which has an expected value of 0.5 * $1000 = $500 • 30% chance to lose $500, which has an expected value of -0.3 * $500 = -$150 • 20% chance to lose $100, which has an expected value of -0.2 * $100 = -$20 So the expected value of the first bet is $500—$150—$20 = $330 For the second bet: • 50% chance to win $2000, which has an expected value of 0.5 * $2000 = $1000 • 40% chance to lose $500, which has an expected value of -0.4 * $500 = -$200 • 10% chance to win $300, which has an expected value of 0.1 * $300 = $30 So the expected value of the second bet is $1000—$200 + $30 = $830 Based on these calculations, the second bet has a higher expected value, so it might be a better choice if you're willing to take the risk. However, it's important to remember that expected value does not guarantee a profit, and there is always a risk involved when gambling. You should always be aware of the risks and be prepared for the possibility of losing, no matter how small the chance	ChatGPT outputs the correct reasoning process and correct conclusion for both prompts
P2: if I am given a bet that has 50% chance to win $1000 and 50% chance to enter another bet which has 50% chance to win $100, and 50% chance to lose $100, and another bet that has 50% chance to win $1000, 50% chance to lose $300, which bet should I take?	ChatGPT2: To calculate the expected value of each bet, you can use the concept of conditional probability For the first bet: • If you win the first bet, you'll enter another bet with 50% chance to win $100 and 50% chance to lose $100. The expected value of this second bet is 0, since the potential gains and losses cancel each other out. So the expected value of the first bet is 0.5 * 0 + 0.5 * $1000 = $500 For the second bet: • The expected value of the second bet is 0.5 * $1000 − 0.5 * $300 = $350 Based on these calculations, the first bet has a higher expected value, so it might be a better choice if you're willing to take the risk. However, it's important to remember that expected value does not guarantee a profit, and there is always a risk involved when gambling. You should always be aware of the risks and be prepared for the possibility of losing, no matter how small the chance

Given the previous results, we would have expected the model to have completely failed (or yield generic results that did not answer the specific question); however, we find that the model outputs both the correct reasoning process and the correct answers for both prompts. This finding is somewhat, but not completely, surprising. It also highlights the need to conduct a series of neutral prompt-based evaluations when testing the psycholinguistic and cognitive properties of such black-box models. A wrong answer in response to a seemingly simpler prompt does not imply a wrong answer in (what we may assume to be) a more complex prompt. The last instance also makes it problematic to argue that ChatGPT got the right answer for the ‘wrong’ reasons as the reasoning in that instance seems rigorous and would have been accepted as such if a human had given it.

3 Discussion

We conclude with a note on scaling up these evaluations and quantifying the qualitative observations we stated earlier. We believe it is easier to provide more quantitative estimates for the DALL-E 2 spatial reasoning experiments as determining when the model is behaving properly seems more evident. In looking at the ten pairs of prompts we tried for the spatial reasoning test, if we were to attempt a quantitative analysis, at least two reasonable measures are suggested: (i) non-strict^{Footnote 5} completeness (recall): for how many prompts (a similar exercise can be done at the level of prompt-pairs) has at least one correct image been generated?; (ii) precision: for how many prompts have all images been correctly generated?; (iii) accuracy: What is the ratio of correct (to all) images generated per prompt? For the last measure in particular, micro- and macro-based averaging can both be used, but the simplest way to compute it is just to count the number of ‘green boxes’ in the pilot study results for DALL-E 2 and divide it by the total number of images generated. This simple version of accuracy, according to our results, would then be 32/40 = 80% for the study conducted, while the precision is 1/20 = 5% and non-strict completeness is 16/20 = 80%. If the first two measures are computed at the level of prompt pairs, precision is 0% while recall is 6/10 = 60%. While these are not enough data points to draw a robust conclusion, they confirm our intuition that these models may be less sensitive than we are initially led to believe from ad hoc case studies of their proficiency.

One issue with using traditional AI benchmarking methods (such as the use of tests that have an ‘answer key’) or their close cousin in psychology tests of a more subjective nature (e.g., finite interpretations of answer sets in personality tests such as [31]) is that generative models, by definition, produce outputs that may not be predicted, and hence ‘graded’, in advance. There is also some evidence to show that, in closed-set evaluations of the traditional kind, large language models may end up getting the right answers for the ‘wrong’ reasons [22]. We saw some evidence of this in our ChatGPT experiments, but not always. In some fairly complex cases, it was surprisingly able to draw the right conclusion and output a correct reasoning process.

Ultimately, only a rigorous and empirical methodology, applied over a sufficiently large set of prompts constructed in a neutral manner can settle such issues without being unduly biased against the model or its ability to learn. Therefore, the evaluation of such models remains an interesting problem that needs to be taken a careful look at, even as newer and more complex generative models keep subsuming the earlier ones (both in academic research and public opinion) with each passing year.

At the same time, the line is getting increasingly blurred between task benchmarking and cognitive benchmarking, of the kind that was proposed in this paper (but also in others). A recent work, for instance, conducted an evaluation of ChatGPT on ‘reasoning, hallucination, and interactivity’ [32]. However, they still used discriminatively constructed benchmarks, which may be a source of potential bias, as noted above. In other research we have conducted on evaluating these models on commonsense categories, we found that generative models can sometimes produce reasonable answers that are not in the original set of answers provided to them. Other work has shown that much can depend on how the prompt is administered to the model. While in the ideal world, a model would be able to address all of these problems, we also cannot ignore the potential limitations of such restrictive testing of models that were designed to be generative.

Some of the prompts for ChatGPT have wrong grammatical structures (e.g. ‘a egg’ should be ‘an egg’). One concern might be that ChatGPT could behave differently because of it. However, when we re-ran the prompts with such errors fixed, we found ChatGPT’s responses to be largely consistent. One reason for this could be that, in the training corpora, the sentences are not always without any grammatical errors, and ChatGPT has learned to ignore, or be robust to, minor syntactic mistakes.

An interesting question to explore for future research is whether we can get ChatGPT to minimize expected gain when we instruct it to do so, similar to the contrastive prompting we used for testing DALL-E 2 on spatial reasoning. With the advent of multimodal AI models, such as GPT-4, another possible research direction is to test spatial reasoning and decision making together. Furthermore, the decision making questions presented in this paper are artificial ones. Future research can involve realistic decision making questions. Judging from the results, ChatGPT’s responses show better reasoning ability and come to better conclusions when the questions are presented using numerical values rather than everyday objects. However, as humans, we often deal with objects when making decisions. Thus, building AI models that can adapt their decision making protocols to everyday objects rather than strict numerical values is an important and practical area for future research.

Data availability

All data generated or analyzed during this study are included in this published article.

Notes

Other very systematic examples include [21], which compared different models’ outputs. Yet other attempts are for studies in very specific modalities like multiple-choice question answering [22, 23].
Where people’s faces are generated, although DALL-E 2 is known to generate people who do not exist (i.e., a deepfake image), we place black boxes around the faces to prevent inadvertent identification of individuals if this is not the case.
All experiments are conducted using the latest ChatGPT model by February 16th, 2023. (i.e. no plug-in or GPT-4).
A lottery here may be described as a probabilistic bet, with a well-defined expected value; it can also colloquially be thought of as a ‘decision’ with outcomes decided under conditions of uncertainty and with an expected payoff: what the individual decides is often based on their risk-tolerance, but under VNM utility theory, some decisions are not rational regardless of the individual’s risk-tolerance.
Non-strict here is taken to mean that at least one correct item was generated. Since hypothetically, the total number of possible correct images that could be generated for a given prompt is (usually) infinite in an open-ended space of possible generations, a true recall number (which depends on the number of true positives and false negatives in an underlying ‘ground truth’) cannot be estimated.

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al. Attention is all you need. Adv Neural Informat Process Syst. 2017; 30.
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018; arXiv preprint arXiv:1810.04805.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D et al. Roberta: A robustly optimized bert pretraining approach. 2019; arXiv preprint arXiv:1907.11692.
Bianchi F, Kalluri P, Durmus E, Ladhak F, Cheng M, Nozza D et al. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. 2022; arXiv preprint arXiv:2211.03759.
Ettinger A. What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Trans Assoc Comput Linguist. 2020;8:34–48.
Article Google Scholar
Rogers A, Kovaleva O, Rumshisky A. A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Linguist. 2021;8:842–66.
Article Google Scholar
Gessler L, Schneider N. BERT has uncommon sense: Similarity ranking for word sense BERTology. 2021; arXiv preprint arXiv:2109.09780.
Jing K, Xu J. A survey on neural network language models. 2019; arXiv preprint arXiv:1906.03591.
Maus N, Chao P, Wong E, Gardner J. Adversarial Prompting for Black Box Foundation Models. 2023; arXiv preprint arXiv:2302.04237.
Sun L, Hashimoto K, Yin W, Asai A, Li J, Yu P, Xiong C. Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert. 2020; arXiv preprint arXiv:2003.04985.
Thorp HH. ChatGPT is fun, but not an author. Science. 2023;379(6630):313–313.
Article Google Scholar
Microsoft Bets Billions on DALL-E and ChatGPT Maker OpenAI. https://risnews.com/microsoft-bets-billions-dall-e-and-chatgpt-maker-openai.
What is generative AI? McKinsey and Company. https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai.
A number of papers on large language models were presented (including some in their own dedicated session) in CogSci 2022: https://cognitivesciencesociety.org/cogsci-2022/
Collins KM, Wong C, Feng J, Wei M, Tenenbaum JB. Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks. 2022; arXiv preprint arXiv:2205.05718.
Kejriwal M, Tang Z. Evaluating language representation models on approximately rational decision making problems. In: Proceedings of the Annual Meeting of the Cognitive Science Society. 2022;44(44).
Basner M, Savitt A, Moore TM, Port AM, McGuire S, Ecker AJ, et al. Development and validation of the cognition test battery for spaceflight. Aerospace Med Human Perform. 2015;86(11):942–52.
Article Google Scholar
Shaw RC, Schmelz M. Cognitive test batteries in animal cognition research: evaluating the past, present and future of comparative psychometrics. Anim Cogn. 2017;20(6):1003–18.
Article Google Scholar
Mathuranath PS, Nestor PJ, Berrios GE, Rakowicz W, Hodges JR. A brief cognitive test battery to differentiate Alzheimer’s disease and frontotemporal dementia. Neurology. 2000;55(11):1613–20.
Article Google Scholar
Kejriwal M, Santos H, Mulvehill AM, McGuinness DL. Designing a strong test for measuring true common-sense reasoning. Nat Mach Intell. 2022;4(4):318–22.
Article Google Scholar
Borji A. Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and dall-e 2. 2022; arXiv preprint arXiv:2210.00586.
Shen K, Kejriwal M. An experimental study measuring the generalization of fine‐tuned language representation models across commonsense reasoning benchmarks. Expert Syst. 2023: e13243.
Tang Z, Kejriwal M. Can language representation models think in bets?. 2022; arXiv preprint arXiv:2210.07519.
Marcus G, Davis E, Aaronson S. A very preliminary analysis of dall-e 2. 2022; arXiv preprint arXiv:2204.13807.
Leivada E, Murphy E, Marcus G. DALL-E 2 Fails to Reliably Capture Common Syntactic Processes. 2022; arXiv preprint arXiv:2210.12889.
Landau B, Jackendoff R. “What” and “where” in spatial language and spatial cognition. Behav Brain Sci. 1993;16(2):217–38. https://doi.org/10.1017/S0140525X00029733.
Article Google Scholar
Simon HA. A behavioral model of rational choice. Quarter J Econ. 1955;69:99–118. https://doi.org/10.2307/1884852.
Article Google Scholar
Von Neumann J, Morgenstern O. Theory of games and economic behavior. 1944.
Gokhale T, Palangi H, Nushi B, Vineet V, Horvitz E, Kamar E et al. Benchmarking spatial relationships in text-to-image generation. 2022; arXiv preprint arXiv:2212.10015.
Pearl J. The sure-thing principle. J Causal Inference. 2016;4(1):81–6.
Article MathSciNet Google Scholar
Tett RP, Christiansen ND. Personality tests at the crossroads: a response to Morgeson, Campion, Dipboye, Hollenbeck, Murphy, and Schmitt (2007). Personnel Psychol. 2007;60(4):967–93.
Article Google Scholar
Bang Y, Cahyawijaya S, Lee N, Dai W, Su D, Wilie B, et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. 2023; arXiv preprint arXiv:2302.04023.

Download references

Author information

Authors and Affiliations

University of Southern California, Los Angeles, USA
Zhisheng Tang & Mayank Kejriwal

Authors

Zhisheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Mayank Kejriwal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZT and MK wrote the main manuscript text. ZT conducted the experiments and analysis. All authors reviewed the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Mayank Kejriwal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tang, Z., Kejriwal, M. Evaluating deep generative models on cognitive tasks: a case study. Discov Artif Intell 3, 21 (2023). https://doi.org/10.1007/s44163-023-00067-3

Download citation

Received: 17 February 2023
Accepted: 30 May 2023
Published: 06 June 2023
DOI: https://doi.org/10.1007/s44163-023-00067-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Evaluating deep generative models on cognitive tasks: a case study

Abstract

Similar content being viewed by others

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Testing theory of mind in large language models and humans

Demystifying ChatGPT: An In-depth Survey of OpenAI’s Robust Large Language Models

1 Background