1 Introduction

The soundness of the TT as a test for intelligence has been constantly debated. Hernández-Orallo (2017), among others, argues that:

The standard Turing test is not a valid and reliable test for HLMI [Human Level Machine Intelligence].[…] the Turing test aims at a quality and not a quantity. Even if judges can give scores, in the end any score of humanness is meaningless. (129)

My view is that the fault of the TT is one of interpretation and experimental design rather than experimental concept. To show this, I propose a new version of the TT, called QTT. In the QTT, the entityFootnote 1 must accomplish a yes/no enquiry in a humanlike and strategic way, where ‘strategic’ means with as few questions as possible.Footnote 2 My claim is that the QTT (i) improves the experimental design of the TT, by minimising both the Eliza EffectFootnote 3 and the Confederate EffectFootnote 4; and (ii) prevents both Artificial StupidityFootnote 5 and BlockheadFootnote 6 from passing.

The rest of the paper is structured as follows. In the next section, I review two interpretations of the TT: the Original Imitation Game (OIG), advocated by Sterrett (2000); and the Standard Turing Test (STT), advocated by Moor (2001). In Sect. 3, I discuss two problems with the TT: (i) Artificial Stupidity and (ii) Blockhead. In Sect. 4, I introduce the QTT, describe my study, and show the results gained. Finally, in Sect. 5, I consider four possible objections to the QTT.

2 Interpretations of the Turing Test

In this section, I review two different interpretations of the TT: (i) the Literal Interpretation, endorsed by the Original Imitation Game (Sterrett 2000); and (ii) the Standard Interpretation, endorsed by the Standard Turing Test (Moor 2001). The former holds that the results of the TT are given by the comparison between the human’s performance and the machine’s performance; and the latter holds that the results are given directly by the judge’s decision, with no benchmark or comparison needed. I advocate the Literal Interpretation as the proper one, and I use the experimental design of the OIG as the experimental design of the QTT.

2.1 Literal Interpretation (OIG)

The Original Imitation Game (OIG) is based on the first formulation of the test given by Turing (1950), and it involves two phases. The first phase is played by A (man), B (woman) and C (the judge): here, C asks questions to A and B in order to identify the woman. The second phase, introduced by the question “What will happen when a machine takes the part of A in this game?”,Footnote 7 is played in the same way by M (machine), B (woman) and C (the judge). If C decides “wrongly as often […] as [C] does when the game is played between a man and a woman”,Footnote 8 then M passes the test. In other words, M passes if it is identified as B in the second phase as frequently as A is identified as B in the first phase (see Fig. 1).Footnote 9

Fig. 1
figure 1

Shows the two phases of the OIG

Sterrett (2000) holds that the OIG provides the appropriate experimental design to test for intelligence. This is because, in the OIG, the results are given by the comparison between (i) the frequency with which C misidentifies A and (ii) the frequency with which C misidentifies M. Moreover, the OIG focuses on a specific notion of machine intelligence: since both A and M has a task, that is to imitate B, the OIG evaluates the resourcefulness of the machine in performing a task, compared to the resourcefulness of the human in performing the same task. So, Sterrett (2000) concludes, the OIG:

[…] constructs a benchmark of intellectual skill by drawing out a man’s ability to be aware of the genderedness of his linguistic responses in conversation. (550)

Sterrett’s interpretation has been criticised as leading to a gender-oriented test for intelligence. And it is frequently pointed out that Turing was interested in the imitation of a human mind,Footnote 10 not a male or a female one.Footnote 11 A careful reading of Sterrett (2000), however, reveals that she does not intend cross-gendering to be a necessary implementation of the OIG. On the contrary, she agrees that:

[…] cross-gendering is not essential to the test; some other aspect of human life might well serve in constructing a test that requires such self-conscious critique of one’s ingrained responses. The significance of the cross-gendering in Turing’s Original Imitation Game Test lies in the self-conscious critique of one’s ingrained cognitive responses it requires. (550–51)

This appears to be compatible with Traiger (2000), who holds that:

"A" and "B" could be placeholders for whatever characteristics may be used in different versions of the game. Turing’s formulation invites generalization. (565)

2.2 Standard Interpretation (STT)

The Standard Turing Test (STT) is based on the second formulation of the test given by Turing (1950), which is introduced by the following question: can a computer, given enough storage and speed, “be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?”Footnote 12 The STT involves a single phase, and it is played by M (machine), B (human) and C (the judge): here, C asks questions to M and B in order to identify the human (see Fig. 2).Footnote 13 M passes if C cannot tell the difference, and no comparison is needed—or available.

Fig. 2
figure 2

Shows the only phase in the STT

According to the Standard Interpretation, the first phase of the TT is introductory. Moor (2001) argues that only the second phase, where the contestants are a human and a machine, matters. The first phase, involving a man and a woman, “is at most an intermediary step toward the more generalized game involving human imitation.”Footnote 14 Similarly, Shah and Warwick (2010) argue that:

[…] Turing merely introduced the human-only (man-woman) imitation game initially to draw the reader in, and through the text, lay the foundation for acceptance of a machine to compete, pitted against a human comparator, in a form and competition in which humans are vastly different and successful at from other species: language. (451)

The problem is that not only the STT lacks a comparative measure and relies solely on C’s judgement, but it also exonerates B from any taskFootnote 15: only M has to imitate a human, B does not have to make any intellectual effort. Whereas the OIG “compares the abilities of man and machine to do something that requires resourcefulness of each of them”,Footnote 16 the STT evaluates only the machine, not the human. Due to this unfairness, I agree with Sterrett (2000) that the STT “[…] is just too sensitive to the skill of the interrogator to even be regarded as a test.”Footnote 17

3 Problems with the TT

In this section, I discuss two problems with the TT: (i) Artificial Stupidity,Footnote 18 which refers to the use of uncooperative, but humanlike, responses to evade any possible interaction during the TT; and (ii) Blockhead, the logically possible look-up table that “can produce a sensible sequence of verbal responses to a sequence of verbal stimuli, whatever they may be.”Footnote 19 I argue that both the OIG and the STT are affected by these two problems. First, they cannot prevent the entities from being uncooperative and evasive. This is true for the OIG, where there are no objectively right or wrong things to say to impersonate effectively; and even more so for the STT, where there isn’t any real task to accomplish. And second, they cannot avoid the possibility that all the input that the machine receives are paired with an output by brute-force.

3.1 Artificial Stupidity

Artificial Stupidity is a possible strategy to exploit the experimental design of the TT, both OIG and STT. The reason why Artificial Stupidity works, I argue, is that the TT parametrises the entity along ‘human-likeness’ alone—or, better, ‘B-likeness’ (where ‘B’ is a woman in the OIG, and a human in the STT). With no other dimensions to parametrise the entity, it does not really matter what the entity says, as long as it is humanlike enough. In the TT, in other words, all that matters is the style of the entity’s interactions. Artificial Stupidity is not to be confused with Artificial Fallibility, which refers to the cognitive boundaries that a machine should show to be attributed with ‘human-likeness’, although the boundaries between the two are very thin. To clarify this distinction, I show two examples provided by Turing (1950). The first involves Artificial Fallibility:

Q: Add 34,957 to 70,764.

A: (Pause about 30 s and then give as answer) 105,621. (434)

It is worth noting that the entity takes a relatively long time to provide the response, and the result is incorrect (the right one is 105.721). This is a plausible outcome for an average human. Now, let’s suppose that the entity is a machine: if it replies correctly and too quickly, then it “would be unmasked because of its deadly accuracy.”Footnote 20 However, as Turing specifies, the machine “would not attempt to give the right answers to the arithmetic problems. It would deliberately introduce mistakes […].”Footnote 21 The second example is more subtle and involves a mix of Artificial Fallibility and Artificial Stupidity:

Q: Please write me a sonnet on the subject of the Forth Bridge.

A: Count me out on this one. I never could write poetry. (ibid.)

Here, the reply is neither right nor wrong, but simply uncooperative (Artificial Stupidity). Few humans could knock off a sonnet during a conversation, and so such uncooperativeness is understandable (Artificial Fallibility). This shows, however, a crucial flaw in the TT: the conflation between ‘human-likeness’ and what I call ‘correctness’. To generalise, for every request from the judge, the entity can always reply something evasive like “I’m not in the mood today, let’s talk about something else”. So, I argue, in a fixed-length TTFootnote 22 it is not possible to discriminate between humanlike intelligence and humanlike stupidity, due to the possibility for the entity to give an uncooperative, but humanlike, reply.Footnote 23

I define a reply as uncooperative when it breaks the Cooperative PrincipleFootnote 24 proposed by Grice (1975). In other words, a reply is uncooperative when it evades the question. Artificial Stupidity endorses Turing’s idea that the “machine would be permitted all sorts of tricks so as to appear more man-like […].”Footnote 25 Because of this, Artificial Stupidity is arguably the most versatile strategy to pass the TT, by potentially evading any interaction whatsoever without giving ‘human-likeness’ away (for a human could plausibly give uncooperative replies as well). So, given the experimental design of the TT, the entity does not really need to hold a conversation like a human: it just needs to evade a conversation like a human.

The problem of deception has been faced by Levesque (2011, 2012), who proposes a variation of the TT called the Winograd Schema Challenge (WSC), where the participants have to identify the antecedent of an ambiguous pronoun in a statement, showing not only natural language processing but also the use of common sense.

3.2 Blockhead

Blockhead is a thought experiment designed by Block (1981), but already popular in the ‘50 s.Footnote 26 It is intended to show that the TT allows the logical possibility of an unintelligent entity, built with a hand-coded table of appropriate verbal responses to a variety of verbal stimuli, whatever they may be, to be attributed with intelligence. Apart from being physically unfeasible, there are two problems with Blockhead. (i) It would not possess any algorithm to adapt to different conversational circumstances, meaning that Blockhead cannot perform any conversational task other than pairing an input with an output by brute-force. And (ii) it would not possess any algorithm to optimise the search through its table, meaning that it could potentially take a very long time to emit a response.Footnote 27 So, Blockhead may seem rejected as a viable approach to pass the TT. However, despite being only a logical possibility, or despite its potential slowness in producing a reply, I argue that Blockhead still represents a weakness in the experimental design of the TT. This is because of—at least—three cases, which I argue to be both logically possible and physically feasible: (i) Expert Blockhead; (ii) Stupid Blockhead; and (iii) Learning Blockhead. These cases, it’s worth noting, make the full Blockhead redundant.

3.2.1 Expert Blockhead

Expert Blockhead has an incomplete, hand-coded table of cooperative verbal responses. Its table is adequate to accomplish a specific task and to work reasonably fast. However, Expert Blockhead sacrifices its ‘human-likeness’, since its table is too small to include every possible humanlike response.

3.2.2 Stupid Blockhead

Stupid Blockhead has an incomplete, hand-coded table of uncooperative verbal responses. Its table is not adequate to accomplish any task other than evading topics in a humanlike fashion, but it can work reasonably fast. As Block (1981) remarks,Footnote 28 Stupid Blockhead works like Eliza,Footnote 29 and it can pass the TT by exploiting the judge’s beliefs during the conversation.

3.2.3 Learning Blockhead

Learning Blockhead independently learns its table (e.g. by scouring the internet and memorising any verbal interactions it finds), and it can produce many appropriate responses in a humanlike fashion. While “the whole point of the machine [Blockhead] is to substitute memory for intelligence,”Footnote 30 the whole point of Learning Blockhead is to substitute memorisation for learning. Even though it could still potentially take too long to emit a reply, Learning Blockhead is not as slow as Blockhead, and it might be attributed with ‘human-likeness’.

4 The Questioning Turing Test

As the name suggests, the QTT is focused on a specific kind of conversation, that is, enquiries. The QTT is intended to evaluate the candidate entity for the ability to accomplish a yes/no enquiry with as few humanlike questions as possible. The aim of the enquiry in the QTT can vary, and different versions can be designed. This means that the judge, unlike in the TT, can be either an average person or an expert. An example is the First Aid QTT, where the entity takes the medical history of a patient (judge), and its performance is scored against the performance of a real doctor. Or the Detective QTT, where the entity interrogates the suspect (judge), and its performance is scored against the performance of a real detective. In the context of the extended QTTs, where the enquiry is not limited to 20 yes/no questions, but open to a full natural language enquiry, “strategicness” might be intended as the ability to ask as many questions as possible, depending on the nature of the enquiry. For instance, an enquiry about an unknown chess variation, or a difficult scientific problem, might require a lot of time and an extensive and exhaustive research of every possibility. In general, however, it can be argued that a strategic enquiry always involves fewer questions, avoiding redundant ones.

In this section, I discuss: (i) the switch in my experimental design; (ii) the viva voce setup; (iii) the experiment involved in my study; and (iv) the results gained so far.

4.1 From SISO to SOSI

The TT’s text-based exchange can be defined as “symbols-in, symbols-out”Footnote 31 (SISO). This means that usually, in the TT, the entity needs to receive some interactions from the judge in order to emit a response. The TT, in other words, is a test for stimulus–response systemsFootnote 32 which, according to McKinstry (2006), can.

[…] respond in a perfectly humanlike fashion to previously anticipated stimuli and an approximately humanlike fashion to unanticipated stimuli, but they are incapable of generating original stimuli themselves. (296)

The SISO model can be thus considered responsible for many false positivesFootnote 33 in the TT. In order to avoid this, in the QTT I introduce the switch (see Fig. 3)Footnote 34 from the SISO model to its reverse model, which I call ‘symbols-out, symbols-in’ (SOSI).

Fig. 3
figure 3

Shows the design of a SISO test and a SOSI test

The switch from SISO to SOSI allows the QTT to parametrise the entity along three dimensions: (i) ‘human-likeness’, attributed by the judge just like in the TT; (ii) ‘correctness’, which evaluates if the entity accomplishes the yes/no enquiry; and (iii) ‘strategicness’, which evaluates how well the entity accomplishes the enquiry, in terms of the number of questions asked (the fewer, the better). ‘Human-likeness’ is intended to set the average bar of success, and to prevent an oracleFootnote 35 entity from passing. ‘Correctness’ is intended to prevent Artificial Stupidity from passing by evading the conversation in an uncooperative but humanlike way. And ‘strategicness’ is intended to prevent Blockhead (as well as Expert, Stupid and Learning Blockhead) from passing by producing verbal interactions by means of a (potentially very long-lasting) brute-force search. ‘Correctness, like Levesque’s (2011, 2012) WSC, is intended to prevent the problem of deception; unlike the WSC, however, the QTT does not require any particular competence from the judges, thus avoiding potential chauvinism.

A further advantage of the switch from SISO to SOSI is that it allows a hybrid version of the QTT to be played. Hybrid Systems have gained a growing interest in recent years, and they can be described as systems in which humans and machines work together,Footnote 36 performing better than by themselves considered individually.Footnote 37 In the Hybrid QTT, the role of the entity is played by both a human and a machine, both cooperating to accomplish the enquiry with as few humanlike yes/no questions as possible. This, I argue, is an important advantage over the TT, where there would be little point in the machine/human cooperation. To generalise, SISO games are competitive ones, whereas SOSI games can be either competitive or cooperative. The QTT, given its experimental design and its focus on enquiries rather than open conversations, provides a viable setting for the cooperation, not only competitiveness, between entities. In this paper, the Hybrid QTT is not discussed in detail, and the potential implications are not explored in full; it does, however, provide an interesting basis for future work.

Summing up, the SISO approach always requires the judge to speak first, or ask a question first (“symbols in” can be rephrased in “ask a question to the entity”, and symbols out can be rephrased in “wait for the entity’s reply). The SOSI approach allows the entity to speak first, or ask a question first (symbols out can be rephrased in “let the entity ask a question”, and “symbols in” can be rephrased in “wait for the judge’s reply). In general, it can be said that the SISO approach (the judge asks a series of questions and the entity replies) is a competitive one. In contrast, the SOSI approach (the entity asks a series of questions and the judge replies) can be either competitive or cooperative.

4.2 Viva Voce

Apart from the parallel-paired TT, where there are always three participants (A, B and C), there is another possible setup of the TT. This is the one-to-one test, where A (entity) and C (the judge) have a text-based conversation, and C evaluates A’s performance. Turing (1950) calls it viva voce:

The [imitation] game (with the player B omitted) is frequently used in practice under the name of viva voce […]. (446)

Since the TTs that I run are limited to 3 questions, and the QTT and Hybrid QTT are limited to 20 yes/no questions, I use the viva voce setup for the experiment. In the case of open-ended TTs and QTTs, the proper approach would be the parallel-paired one. The viva voce setup also has the advantage of making the experiment as simple and as quick as possible. This choice has two justifications: 1. The first is the practical reason to minimise the potential chauvinistic consequences that an open conversation might generate; 2. The second is to optimise the resources and location of the experiment: I run the experiment during a three days event at the National Museum of Scotland, and the range of participants went from children to adults.

It is worth recalling, however, that the viva voce QTT is still made of two procedures: one involves a human, and the other a machine. The results of the QTT do not rely solely on the judge’s decisions (as the STT), but on the comparison between the performances of the entities. In this regard, the QTT advocates the OIG as the proper experimental design of the TT.

As follows (see Fig. 4)Footnote 38 I show the two procedures of the viva voce QTT, the human-questioning-human (HqH) and the machine-questioning-human (MqH), where C thinks of a public figure, and A and B try to guess whom by asking yes/no questions:

Fig. 4
figure 4

Shows the two procedures of the QTT

4.3 The Experiment

The experiment is part of my PhD research, and it is not available online for readers or testers. The participants (that is, the judges) of the experiment are volunteers (kids and adults, females and males) during a series of events at the National Museum of Scotland, Edinburgh.

Each experiment is divided into four tests. (i) The first is a three questions TT, where the judge is asked, at each interaction, to rate how much, on a scale of 0–10, the entity is human. (ii) In the second test, which I call TT2, the judge has to come up with three bias-free puzzles (e.g. alphanumeric riddles). At each interaction, the judge is asked to rate how much, on a scale of 0–10, the entity’s reply is correct; and how much, on a scale of 0–10, the entity is human. (iii) The third test is a twenty yes/no questions QTT, where the judge has to think about a public figure, and the entity has to guess whom. The questions asked by the entity are rarely the same, for the participants usually think of different public figures. Even in the case that two participants think about the same public figure, it is very unlikely that the enquiry is exactly the same. The judge decides whether the entity is human and whether the enquiry is accomplished; and the entity’s ‘strategicness’ is inferred from the number of questions asked. (iv) The last test is a twenty yes/no questions Hybrid QTT, where the role of the entity is played by both human and machine. The tasks of the human are (i) to rephrase the questions asked by the machine, in order to make them more humanlike, and send them to the judge; and (ii) to send the judge’s replies to the machine. Also, the human can decide to skip questions that are redundant. Again, the judge decides whether the entity is human and whether the enquiry is accomplished, and the entity’s ‘strategicness’ is inferred from the number of questions asked. By this division of tasks, I do not implicitly claim that the bot would always need a human component to perform properly. The Hybrid QTT is intended to show that the SOSI QTT can be either a competitive or a cooperative game, which I hold to be a further advantage over the SISO, that is competitive, TT. Each test (TT, TT2, QTT and Hybrid QTT) of the experiment is divided into two procedures: one is a human-vs-human game, the other is a machine-vs-human game. The results are given by the comparison between the human’s performance and the machine’s performance.

Below, I show the transcript of one of the experiments. Here it is possible to see a few examples of the different kinds of questions required during the different tests.

figure a
figure b

4.4 The Bots

I use two bots to run the tests: Cleverbot for the TT and the TT2; and Akinator for the QTT and Hybrid QTT. It’s useful to keep in mind that both Cleverbot and Akinator are G-rated games, that is, they are suitable for family gameplay and, therefore, certain elements are censored. Akinator is very good at the yes/no guessing game, but it cannot engage in open-ended conversations as well. Therefore, Akinator cannot perform convincingly in a normal TT. That’s why I use Cleverbot for the TT. Using two bots, it’s worth noting, does not affect the overall significance of the experiment. My justification is that merging Cleverbot and Akinator into a single program would not be that difficult, and so using two programs to run the experiment doesn’t imply that machines cannot carry out both tasks, the TT conversation and the QTT enquiry.

CleverbotFootnote 39 is a chatbot developed by Rollo Carpenter, and it is designed to learn the interactions of its table from the public, during its conversations. Cleverbot, as described by its creator, “uses deep context within 180 million lines of conversation, in many languages, and that data is growing by a million a week.”Footnote 40 In 2011, during the TT competition at the Techniche 2011 festival (IIT Guwahati, India), Cleverbot achieved 59.3% compared to humans' 63.3% on a total of 1334 votes. The algorithm of Cleverbot enables it to compare sequences of symbols against its table, which includes over 170 million items. Now, Cleverbot is not strictly speaking a Blockhead: a brute force approach would not work efficiently with so many items. As the creators explain:

Attempting to search through this many rows of text using normal database techniques takes too much time and memory. Over the years, we have created several custom-designed and unique optimisations to make it work.[…] We realised that our task could be quite nicely divided into parallel sub-tasks. The first step in Cleverbot is to find a couple million loosely matching rows out of those 170 million. We usually do this with database indices and caches and all sorts of other tricks. When servers were busy, we wouldn’t use the whole 170 million rows, but only a small fraction of them. Now we can serve every request from all 170 million rows, and we can do deeper data analysis. Context is key for Cleverbot. We don’t just look at the last thing you said, but much of the conversation history. With parallel processing we can do deep context matching.Footnote 41

AkinatorFootnote 42 is a questioning bot developed by French company Elokence.com. Akinator is designed to play the 20q guessing game: it has to identify the public figure the participant is thinking of by asking as few yes/no questions as possible. Example of yes/no questions asked by Akinator are: “Is your character alive?” or “Is your character fictional?”, and so on. Yes/no questions are useful to potentially rule out as many objects as possible from the knowledge base of the system. Ideally, every question will rule out half of the objects from the table. When Akinator picks a new question, it uses the answers received and looks for probable objects. This means that the enquiry is constantly adapting and shifting from a hypothesis to another. There are three replies available for the player: “Yes”, “No” and “Don’t Know”. From time to time, when a player gets to the end of a game, Akinator points out that there were contradictions. It can, of course, fail the enquiry, and the reason is that the system tries to reflect human knowledge, not necessarily what is objectively true. Akinator learns everything it knows from the people who play the game: it deals with opinions, not necessarily with facts. So, Akinator’s knowledge is not scientific, but generated from the social knowledge and opinions of its users. And in case of wrong conclusions, it is possible to correct Akinator’s knowledge by playing the game thinking about the same character over and over again. Akinator will eventually learn the correct outcome after a few games. And of course, if at the end of a game Akinator does not know the answer, the player has the opportunity to provide it.

4.5 Results

Here I show the results I gained after 60 experiments. The tests involved in the study have a simplified experimental design, where the TT and TT2 have a fixed length of three questions; and the QTT and Hybrid QTT are restricted to yes/no enquiries. The goal of the study is to highlight the weaknesses of the TT and show the experimental advantages of switching to the SOSI setup and parametrising the entity along other dimensions in addition to ‘human-likeness’. This not only minimises false negatives and positives (Eliza and Confederate Effect), but also prevents uncooperative (Artificial Stupidity) and brute-force (Blockhead) approaches from passing. Also, the study shows that the QTT allows building a third benchmark, scoring thus not only the performance of the human and the performance of the machine, but also the performance of the hybrid entity. In the following tables, I show the data I gained. It is worth noting that these results are mainly exploratory, especially the results of the Hybrid QTT, which tell us that we can build an artefact with humanlike reasoning if we use a human. This may appear trivially true, but it is an explored strategy in AI (human-in-the-loop) and Human–Computer Interaction (mixed-initiative computing). However, the results of the Hybrid QTT show that a hybrid entity can make many more errors than a machine, yet still be considered more humanlike.

Finally, the Hybrid QTT is not intended to undermine the TT, which evaluates whether a machine—and not a machine with the help of human—can pass for a human. The Hybrid QTT is intended to highlight an important advantage of the QTT (and of SOSI tests in general): the QTT can be played either competitively and cooperatively, whereas the TT (and SISO tests in general) can be played only competitively (Tables 1, 2, 3).

Table 1 Human: 30 (Total Tests: 60)
Table 2 Machine: 30 (Total Tests: 60)
Table 3 H/M hybrid: 60 (total tests: 60)

The following are the results of the TT, where the entity has to reply to three questions, and it is evaluated in terms of ‘human-likeness’ alone (Fig. 5)Footnote 43:

Fig. 5
figure 5

Shows the results of the TT

Without any other dimension in addition to ‘human-likeness’ along which to parametrise the entity, my claim is that the TT is the most general and challenging test for intelligence, but at the same time, the most exploitable one. As the graphs show (Fig. 5), there’s a 36% chance that the TT’s outcome, when the entity is played by a human, is a false negative (Confederate Effect); and a 30% chance that the TT’s outcome, when the entity is played by a machine, is a false positive (Eliza Effect).

The following are the results of the TT2, where the entity has to solve three bias-free problems, and it is evaluated in terms of ‘human-likeness’ and ‘correctness’:

The TT2 is explicitly intended to prevent an entity from using Artificial Stupidity in order to exploit the test by evading any interaction whatsoever. It is not intended as a proper update of the TT, for the interactions allowed involve problems and puzzles only, and therefore it is likely to produce chauvinistic results. The point of the TT2 is to force the entity to reply to difficult questions without avoiding them, whereas the TT allows the entity to avoid difficult questions. However, it is interesting to see (Fig. 6)Footnote 44 that, in terms of ‘human-likeness’, (i) the Eliza Effect is ruled out; and (ii) the Confederate Effect is reduced to a 17% chance. In terms of ‘correctness’, humans are always able to answer the right thing, whereas the machine is never able to provide the right answer. This can be one reason why the judge is less prone to make the wrong identifications. However, being ‘correct’ does not necessarily mean being humanlike (e.g. a calculator would be easily unmasked due to its unhuman accuracy). Also, ‘correctness’ can avoid Artificial Stupidity, but it cannot prevent Blockhead from passing.

Fig. 6
figure 6

Shows the results of the TT2

The following are the results of the viva voce QTT, where the entity has to accomplish a yes/no enquiry with as few humanlike questions as possible; and it is evaluated in terms of ‘human-likeness’, ‘correctness’ and ‘strategicness’:

The results of the QTT (Fig. 7)Footnote 45 show that its experimental design is able to minimise the Confederate Effect to a 6% chance and reduce the Eliza Effect to a 20% chance. In other words, the QTT grants better control of false negatives and positives than the TT (where the chances are, respectively, 36% and 30%). This is true even if (i) the machine outscores the human in terms of ‘correctness’, where the former has an 86% chance of accomplishing the enquiry, against the latter’s 26% chance; and (ii) the machine outscores the human in terms of ‘strategicness’, where the former needs, on average, 17 questions per enquiry, and the latter needs, on average, more than 20. However, when the machine outscores the human in terms of ‘correctness’ and ‘strategicness’, it does not mean that the machine passes the test. The reason is that, like other systems such as a calculator, providing the correct and strategic answer is not sufficient to attribute intelligence. To pass the test, the entity still needs to prove its ‘human-likeness’, by means of a conversational style that can be recognised as human. The style of Akinator’s questions, in contrast, is very simple and distant, and can be generalised in the following form: “Is your character x?”, where “x” is usually an adjective (such as “real”, “alive”, “female”, etc.).

Fig. 7
figure 7

Shows the results of the QTT

Finally, the following are the results of the Hybrid QTT, where the entity, played by both a human and a machine, has to accomplish a yes/no enquiry with as few humanlike questions as possible; and its performance is evaluated in terms of ‘human-likeness’, ‘correctness’ and ‘strategicness’:

As it is possible to see (Fig. 8),Footnote 46 the hybrid entity is able (i) to be recognised as human every time, (ii) to accomplish the enquiry very often (88% chance) and (iii) to ask, on average, 15 questions per enquiry. In other words, the hybrid entity outscores both the human and the machine alone, not only in terms of ‘correctness’ and ‘strategicness’, but even in terms of ‘human-likeness’ (due, I suspect, to the overall improved performance).

Fig. 8
figure 8

Shows the results of the Hybrid QTT

5 Objections

Here I consider four objections to the QTT: (i) the first claims that the QTT is chauvinistic; (ii) the second claims that the yes/no questions are not a proper tool for an enquiry, making the QTT too easy; and (iii) the last claims that the QTT cannot prevent Blockhead from passing.

The first objection claims that the QTT is chauvinistic, since it is intended to measure abilities that an intelligent agent may fail to show (such as the ability to be correct and strategic). I reject this view by clarifying that the QTT is intended to measure the ability of the entity to strategically accomplish the aim of an enquiry in a humanlike enough fashion. In other words, ‘humanlike’ still means ‘intelligent’, and it is possible for an agent that shows ‘correctness’ and ‘strategicness’, but not ‘human-likeness,’ to be not attributed with intelligence. Conversely, it is possible for an agent that shows ‘human-likeness’, but not ‘correctness’ and ‘strategicness’, to be attributed with intelligence. It’s worth noting that, even though I argue that evaluating ‘human-likeness’, ‘correctness’ and ‘strategicness’ improves the experimental design of a conversational test for intelligence like the TT, I do not hold these dimensions to be logically necessary conditions for intelligence.

The second objection involves the choice of limiting the QTT’s enquiry to yes/no questions. The justification is provided by Hintikka (1999), who argues that any possible wh-questionFootnote 47 can be reduced to a series of yes/no questions:

THEOREM 2 (Yes–No Theorem). In the extended interrogative logic, if M: T ⊢ C, then the same conclusion C can be established by using only yes–no questions. A terminological explanation is in order here. For propositional question “Is it the case that S1 or... or Sn ?” the presupposition is (S1 ∨… ∨ Sn). We say that a propositional question whose presupposition is of the form (S ∨ ~ S) is yes–no question. (302)Footnote 48

In agreement with Hintikka about the reducibility of any question to a series of yes/no questions, Genot and Jacot (2012) remark that:

The special case of yes-or-no questions is of interest because: (a) their presuppositions are instances of the excluded middle, so they can always be used in an interrogative game; and: (b) the inferential role played by arbitrary questions [wh-questions] can always be played by yes-or-no questions (194)

The last objection claims that the QTT cannot prevent Blockhead from passing, for Blockhead would be able to ask any question whatsoever. My reply is that it is true that a questioning Blockhead would be able to ask whatever question and accomplish whatever enquiry. However, it would take too many random questions to accomplish any enquiry. In other words, Blockhead would fail the QTT because of its slowness and randomness. We can hypothesise a modified Blockhead that keeps track of the answers received, asks itself “What is the next best question to ask given that I have already learned x and y?” and then ask its question. However, in order to ask questions in such a strategic and optimised way, Blockhead would need an information-gathering algorithm to optimise the search through its table (see Fig. 9). With such an algorithm, it is worth noting, Blockhead would not be considered a simple look-up table anymore, and could indeed be considered intelligent. Lacking such an algorithm, Blockhead would have no better way to ask a new question other than by randomly picking one.Footnote 49 This costs Blockhead both its ‘strategicness’ and ‘human-likeness’, for a human would not normally ask completely random or pointless questions (Fig. 9Footnote 50).

Fig. 9
figure 9

Shows the design of an information-seeking algorithm

As follows, I discuss the hypothetical outcome of both the viva voce and the unrestricted QTT when played by: (i) Blockhead, (ii) Expert Blockhead, (iii) Stupid Blockhead and (iv) Learning Blockhead (see Tables 4 and 5). It should be kept in mind that these outcomes are purely speculative, for no experiments have been run to prove such results.

  1. (i)

    Blockhead (see Sect. 3.2) would fail both the viva voce and the unrestricted QTT by showing only ‘correctness’ in accomplishing the aim of the enquiry, but not ‘human-likeness’ (due to its slowness and randomness) or ‘strategicness’ (due to the lack of an information-gathering algorithm).

  2. (ii)

    Expert Blockhead (see Sect. 3.2.1) might be able to pass the viva voce QTT, since it could show both ‘correctness’ and ‘strategicness’, by accomplishing a specific enquiry strategically (depending on the variety of questions programmed and the difficulty of the task set by the judge); it might also be able to do so in a humanlike fashion, but it’s unlikely that it would be able to replicate the success in a series of tests, inductively failing the QTT. Expert Blockhead would fail the unrestricted QTT, failing to show ‘human-likeness’, ‘correctness’ and ‘strategicness’ in any enquiry, except the one in which it is an expert (an example of Expert Blockhead is Akinator).

  3. (iii)

    Stupid Blockhead (see Sect. 3.2.2) would fail both the viva voce and the unrestricted QTT, since it would ask uncooperative and non-strategic questions, failing thus to accomplish any enquiry whatsoever (like Artificial Stupidity would not be able to accomplish any task in the TT other than evading the conversation). Moreover, due to the randomness of its questions, Stupid Blockhead would hardly be attributed with ‘strategicness’ or ‘human-likeness’.

  4. (iv)

    Learning Blockhead (see Sect. 3.2.3), just like Blockhead, would fail both the viva voce and the unrestricted QTT by accomplishing the aim of the enquiry correctly, but not human-likely (due to its slowness and randomness) or strategically (due to the lack of an information-gathering algorithm).

Table 4 Viva voce QTT (yes/no enquiry)
Table 5 Unrestricted QTT (open enquiry)

So, my claim is that Blockhead cannot be avoided in a SISO test, but it can be avoided in a SOSI test. The reason is that, in a SOSI test, even though Blockhead would be able to eventually accomplish an enquiry, it would take too many questions and too much time, failing ‘strategicness’ and ‘human-likeness’, and failing thus the test.

6 Conclusions

In this paper, I propose the QTT in order to improve the experimental design of the TT. In the QTT, where the SISO setup is switched to the SOSI setup, the entity has to accomplish the aim of an enquiry with as few humanlike questions as possible. The QTT has the advantage of parametrising the entity along two further dimensions in addition to ‘human-likeness’: ‘correctness’ (which evaluates the ability to accomplish the aim of an enquiry) and ‘strategicness’ (which evaluates the ability to do so with as few questions as possible). My claim is that the QTT minimises both the Eliza Effect and the Confederate Effect from occurring; and prevents Artificial Stupidity and Blockhead from passing. In other words, the QTT avoids false negatives and positives, and prevents uncooperative and brute-force approaches from passing. In support of this, I discuss my study and the results gained.